Cheat Sheet

Justin Meisenhelter

2022-03-10

Python

Syntax

to put commands on new line use the backslash \

8 * (7 + 6 * 5) \
      + 4 / 3 ** 2 - 1 

self assignment
- a=a+b <—-> a+=b
- a=a-b <—-> a-=b
- a=a*b <—-> a*=b
- a=a/b <—-> a/=b

import libraries
import math
from math import sin, cos
assign an alias for the module name
import math as ma

Magic Commands

autoreload

%load_ext autoreload
%autoreload 2

Jupyter NB Shortcuts

group comment
crtl + / group indent
crtl + [ show function documentation
shift + tab inside function
auto-complete
tab

Mathmatical Operators

print(12 / 3)     # int / int -> float
print(17 // 3)    # floor division
print(17.0 // 3)  # return float if one of the values is float 
print(17 % 3)     # remainder
print(2 ** 7)     # 2 to the power of 7

Indexing

Indexing a String

s[start_position:end_position:delta]

word = 'abcdefghijk'
word[-1:-3:-1]
## 'kj'

Reversing a string

word[::-1]
## 'kjihgfedcba'

Iterators

x = [1,2,5]
x_iter = iter(x)
x_iter
next(x_iter)

converts a list into an interator object the next function will return the first object and delete it calling the next function an additional time will return the second object and delete it.
x.pop() will return the last element of a list and delte it

enumerate

the enumerate() function takes an iterable and outputs a tuple pair of each elemenet and its index

words = [0,2,4,6,8]
list(enumerate(words))
## [(0, 0), (1, 2), (2, 4), (3, 6), (4, 8)]

String Methods

https://www.w3schools.com/python/python_ref_string.asp

word.upper()
word.lower()
word.strip()
word.replace()
word.split()
word.join()
word.find()

String Special Characters

\t <tab>
\n <newline>
r in front of a string makes it a raw string
print(r'C:\some\name')

Functional Operators

map

  • You will often want to apply a function to all elements of a list.
  • Suppose a list L has elements that are all numbers, and f is a function on numbers. Then map(f, L) is the result of applying f to every element of L.
  • The map function returns a map object, which is an iterator. To get the list use the list function. map(<function>, <iterable>) will apply the function to each element of the iterable
    returns a map object
L = [4, 9, 16]
map(math.sqrt, L)

can use lambda functions as well
list(map(lambda val: val**.5, L)) will take the square root of all elements of L

filter

Filter extracts the elements of a list that satisfy some condition. The condition is given in the form of a boolean function.

The filter function also returns an iterator object. Use list to get the associated list.

to call filter we must pass a boolean function, a function that returns true or false

def non_zero(x):
    return x != 0
list(filter(non_zero, [1, 2, 0, -3, 0, -5]))

will return all non zero elements of the list

Multiple Arguments

list(map(lambda x, y: x+y, [1,2,3], [10,20,30]))

will return the sum of element n of both lists

zip

takes two lists of the same length and makes one list containing pairs of the corresponding elements of the two lists

a = [1, 2, 3, 4]
b = [5, 6, 7, 8]
list(zip(a, b))
# [(1, 5), (2, 6), (3, 7), (4, 8)]
list(map(lambda p: p[0]+p[1], zip([1,2,3], [10,20,30])))
# [11, 22, 33]

The operator module

If you want to apply a built-in operation like + in a map, you can write a lambda function - lambda x, y: x+y - but there is a simpler way. The operator module defines named versions of the built-in operations, just for use in functions like map. For example:

from operator import *
print(list(map(add, [1,2,3], [4, 5, 6])))
print(list(map(mul, [1,2,3], [4, 5, 6])))
# [5,7,9]
# [4, 10, 18]

Methods in map calls

to call a method in a map call use a lambda function

lis = [" abc \n", " def", "ghi   "]
list(map(lambda s: s.strip(), lis))

or use the class method call list(map(str.strip, lis))

Printing

Format Specifier

  • format_specifier is a string containing special format symbols, which are used to insert values from the tuple:
    • %s means inserting a string.
    • %d means inserting an integer.
    • %f means inserting a float.
    print("hello, %f!" % 2017)
    print("hello, %d!" % 2017)
    print("hello, %.2f!" % 2017.1314)

    format function

print('My name is {0} and I am {1} years old'.format('Mike', 25))

put index positions in curly brackets then `.format(0,1)

print('My name is {name} and I am {age} years old'.format(name='Mike', age=25))

fast strings

define variables
put an f in front of your string and the curly braces around variable names

name = justin
age = 41
f'my name is {name} and mmy age is {age}

Modulous Operator

print( "hello, %f!" %2017)
## hello, 2017.000000!
print( "hello, %i!" %2017)
## hello, 2017!
print( "hello, %.2f!" %2017)
## hello, 2017.00!
print( "hello, %s!" %'2017')
## hello, 2017!

Functions

def square(x):  
    return x**2

Default Arguments

def polynomial(x, a, b=0, c=0, d=0):  #assigning default values
    result = a + b * x**1 + c * x**2 + d * x**3
    return result
def ask_ok(prompt, retries=4, complaint='Yes or no, please!'):
    while True:
        ok = input(prompt)   # In Python 2, it is called raw_input()
        if ok in ['y', 'ye', 'yes']:
            return True
        if ok in ['n', 'no', 'nop', 'nope']:
            return False
        retries = retries - 1
        if retries < 0:
            raise IOError('Not a good user.')
        print(complaint)

Note It is invalid in Python to put keyword argument in front of positional argument.

*args and **kwargs

  • Sometimes your function has a number of input parameters but you don’t want to specify all of them at the definition.
  • Then you can use args (tuple of arguments) and kwargs (keyword arguments) when you define the function.
args = (2,10,2)
list(range(*args))
## [2, 4, 6, 8]
kwargs = {'arg1':5, 'arg2':10, 'arg3': -2}
def some_function(arg1, arg2, arg3):
  return arg1*arg2/arg3
some_function(**kwargs)
## -25.0
def register_student(name, *args, **kwargs):
    print("-- The student's name is ", name)
    print("-" * 40)
    print(type(args))
    for arg in args:
        print(arg)
    print("-" * 40)
    print(type(kwargs))
    keys = sorted(kwargs.keys())
    for kw in keys:
        print(kw, ":", kwargs[kw])

Unzipping a list

nested_list = [[1,2,3], [4,5,6], [7,8,9]]
# nested list is a zipped list
list(zip(*nested_list))
## [(1, 4, 7), (2, 5, 8), (3, 6, 9)]

Lambda Functions

Add_two_powers = lambda x: x**2 + x**3

syntax: `lambda ’ : <return statement>
Lambda functions can contain if else statements


Firstelt = lambda L: None if L==[] else L[0]
L1=[]
L2=[1,2,3]
print(Firstelt(L1))
print(Firstelt(L2))

Control Flow

Else If

a=6
if type(a)==str:
    print('it is a string')
else:
    if a%2==0:
        print('it is an even number')
    else:
        print('it is an odd number')
## it is an even number
if type(a)==str:
    print('it is a string')
elif a%2==0:
    print('it is an even number')
else:
    print('it is an odd number')
## it is an even number

While Loops

x = 1
while x <= 10:
    if x % 2 == 0 and x % 3 != 0:
        print (x)
    x = x + 1
## 2
## 4
## 8
## 10

break and continue

L = [10, -10, 20, -20, 30, -30, 40, -40, 50, -50, 60, -60]

sum_ = 0
for x in L:
    if x < 0:
# continue will skip the rest of the code and return to the opening loop
        continue
    sum_ = sum_ + x
    if sum_ > 100:
# break the outermost loop
        break
        
sum_
## 150

for loops

for x in range(11):
    if x % 2 == 0 and x % 3 != 0:
        print (x)
## 2
## 4
## 8
## 10

multiple iterables

words = ['a', 'b', 'c', 'd', 'e']
for i, e in enumerate(words):
    print(i, e)

the enumerated list is unpacked and matched to the two indices

Unpacking

def square_all(x, y):
    return x**2, y**2
x, y = square_all(2,3)
x, y
## (4, 9)

will assign the first output value to x and the second output value to y

Handling Errors/Exceptions

Python’s try and except can provide ways to handle exceptions.

Exceptions raised by statements in the body of try are handled by the except statement and execution continues with the body of the except statement.

def divide(x, y):
    try:
        result =  x / y
    except ZeroDivisionError:
        result = 'INFINITY'
    except TypeError:
        result = divide(float(x), float(y))
    return result

Lists

Defining Lists

x=[1,2,3,4,'1','a','hello']
print(x)
## [1, 2, 3, 4, '1', 'a', 'hello']
#create an empty list
y = list()

List Methods

x.index('hello')
## 6
x.append(55)
x.extend([1,2,3])
# will add all elements in the list to the object x
x.insert(2,'d')
# inserts 'd' into the index of 2

List Sorting

staff =[['Lucy','A',9], ['John','B',3], ['Peter','A',6]]
sorted(staff, key = lambda x: x[1]) # key is the sort metric
## [['Lucy', 'A', 9], ['Peter', 'A', 6], ['John', 'B', 3]]
staff =[['Lucy','A',9], ['John','B',3], ['Peter','A',6]]
sorted(staff, key = lambda x: x[2]+len(x[0]))  # key is the sort metric
# will sort by length of name
## [['John', 'B', 3], ['Peter', 'A', 6], ['Lucy', 'A', 9]]

Exersices

Write a function to switch the ith and jth items in a list.

def switch_item(L, i, j):
    ... function body goes here ...

my_list = ['first', 'second', 'third', 'fourth']
switch_item(my_list, 1, -1)
my_list ---> ['first', 'fourth', 'third', 'second']
def switch_item(L, i, j):
    temp = L[i]
    L[i] = L[j]
    L[j] = temp
    return L
my_list = ['first', 'second', 'third', 'fourth']
switch_item(my_list, 1, -1)
## ['first', 'fourth', 'third', 'second']

List comprehension

the format of a list comprehension

[ <return expresion> for <element> in <list> if <boolean> ]
[ x* x for x in [1, 2, 3, 4, 5] if x%2 == 0] #map and filter

if you need an else statement

[ <expr1> if <boolean> else <expr2> for <element> in <list> ]
[ x* x if x%2 == 0 else x+2 for x in [1, 2, 3, 4, 5] ]

can chain together if elses

[x* x if x%2 == 0 else x+2 if x <3 else x *5 for x in [1, 2, 3, 4, 5]

can interact with previous iterations of the comprehension

x + p for p,x in zip([1,2,3,4,5], [1,2,3,4,5][1:])
x=[200,400,23,50,1,23,54,66,88,97,20]
y=[i for i in x if i%2==0]
y
## [200, 400, 50, 54, 66, 88, 20]

Write a list comprehension that takes a list of strings and returns the capitalized version of those strings if they are greater than 2 and fewer than 12 characters in length.

names=['Barry White','aj','tom','John Foster Wallace',
       'Billy Bud','Yifan Ghost Song']
[i.upper() if len(i) > 2 and len(i) <12 else i for i in names]
## ['BARRY WHITE', 'aj', 'TOM', 'John Foster Wallace', 'BILLY BUD', 'Yifan Ghost Song']

Dictionary Comprehension

dict_list = ['a', 'b', 'd', 'z']
{ letter:ord(letter) for letter in dict_list }
## {'a': 97, 'b': 98, 'd': 100, 'z': 122}

Tuple

Tuples are similar to lists in many ways, except they can’t be modified once you’ve created them.

Tuples are defined between two parentheses instead of brackets.

Tuples are iterable but immutable.

tup=("hi", True, 1.0, 2)
tup
## ('hi', True, 1.0, 2)

Set

A set is an unordered sequence of unique values. ### Creating an empty set

z=set() #creating an empty set
z
## set()

Set Function

s={1,2,3,3,3,4}
t={8,4,6,2}
s.difference(t)
## {1, 3}
s-t
## {1, 3}
s.intersection(t)
## {2, 4}
s&t
## {2, 4}
s.union(t)
## {1, 2, 3, 4, 6, 8}
s|t
## {1, 2, 3, 4, 6, 8}
s^t
# all non shared items in set
## {1, 3, 6, 8}

Dictionary

A dictionary is a set of keys with associated values. The key must be hashable, but the value can be any object. It is also known as a hash map, hash table or set of key-value pairs.

thing={}
thing['key1']='val1'
thing[2]='val2'
thing[3]=[3,2,1]

Can construct a dictionary from a list or set of tuples

dict([('sape', 4139), ('guido', 4127), ('jack', 4098)])

Dictionary functions

thing.keys()
## dict_keys(['key1', 2, 3])
thing.values()
## dict_values(['val1', 'val2', [3, 2, 1]])
thing.items()
## dict_items([('key1', 'val1'), (2, 'val2'), (3, [3, 2, 1])])
thing.get('weight')
# thing.get(<key>, <value>) if key does not exist it will return the input value

Word Frequency Histogram

words=['hi','hello','hi','hi','yo','whazzup','hey','hey']
hist={}
for i in words:
    if i in hist:
        hist[i] += 1
    else:
        hist[i] = 1
hist
## {'hi': 3, 'hello': 1, 'yo': 1, 'whazzup': 1, 'hey': 2}

Counter

from collections import Counter, defaultdict
sentence = "this is a sentence and it has a bunch of letters and I want to count a few things"
my_counter = Counter(sentence)
my_counter
## Counter({' ': 18, 't': 9, 'n': 8, 'a': 7, 's': 6, 'e': 6, 'h': 4, 'i': 4, 'c': 3, 'o': 3, 'd': 2, 'u': 2, 'f': 2, 'w': 2, 'b': 1, 'l': 1, 'r': 1, 'I': 1, 'g': 1})

Bitwise Operator

& and | are and and or respectively

10 & 1
# is a simple method for determining if an integer is even

Recursion

A function which calls itself

def factorial(num):
    prod = 1
    while num > 0:
        prod *= num
        num -+ 1
    return prod
# for a recursive function we need a base case and a recusive call
# with recursion
def factorial_rec(num):
# base case
    if num == 0:
        return 1
# recursive call
    return num * facotrial_rec(num-1)

Yield Statement

The yield statement is similar to return, however, it won’t return the whole result (say a very large list) at the end of the function but a generator instead.

def ret(L):
    res = list(map(lambda x: x ** 2, L))
    return res
    
def gen(L):
    for i in L:
        yield i ** 2
# gives us a generator object

we now have a generator and can use the next function to return a recursive result

Web Scraping

example for the following sections:

<html>
    <head>
        <title>Hi</title>
    </head>
    <body>
        <a href="https://www.crummy.com/software/BeautifulSoup/">Hello, BeautifulSoup!</a>
    </body>
</html>

with regular expressions, parse a file, hi.html and look for all expresions preceeded by <title> any of letters <title>

import re
hi_path = 'data/hi.html'
with open(hi_path, 'r') as f:
    hi = f.read()
    print(re.findall('<title>(.*)</title>', hi))

with beautiful soup

from bs4 import BeautifulSoup
with open(hi_path, 'r') as f:
    hi = f.read()
    hi = BeautifulSoup(hi, 'html.parser')
    print(hi.title) # find the title tag
    print(hi.title.string)  # find the value of tag

outputs a beautiful soup object

Names, Values, and Attributes

BeautifulSoup can extract the name, value, and attributes of tags. The corresponding attributes are:
- name - string - attrs

print("The name of a tags is: ", hi.a.name)
print("The value of a tags is: ", hi.a.string)
print("The attribute of a tags is: ", hi.a.attrs)

get_text() & get()

get_text() method will extract the values of the current tag and all its child tags ``` print(hi.html.get_text()) ``` outputs:Hi`

Hello, BeautifulSoup!

  • The get() method is used to extract the value of a specific attribute of a tag. For example, we can get the href of the a tag using the following code.

  • This is the same as running hi.a.attrs first and then finding the value of key href from the attrs dictionary.

print(hi.a.get('href'))

output: https://www.crummy.com/software/BeautifulSoup/

example for the following sections:

<html>
    <head>
        <title>Article</title>
    </head>
    <body>
        <h1 id='one'>This is an h1 tag.</h1>
        <p>This is a paragraph, standard size font.</p>
        <h2>This is an h2 tag, smaller font.</h2>
        <p><a href='https://www.google.com'>This is a link to google.</a></p>
        <h3><a href='https://www.yelp.com/'>This is an h3 tag, even smaller.</a></h3>
        <p>And a normal p tag.</p>
    </body>
</html>

create a BeautifulSoup object from the article.html

article_path = 'data/article.html'
with open(article_path, 'r') as f:
    article = f.read()
    article = BeautifulSoup(article, 'html.parser')

find()

  • find() returns the first p tag, which is equivalent to article.p:
print(article.find('p'))
outputs:

This is a paragraph, standard size font.

  • find_all() returns all descendant p tags as a resultset which is a subclass of list:
print(article.find_all('p'))

output: []

  • To find the tags that have specific attributes, you can pass a dictionary to the attrs argument with the key being the attribute name and the value being the value of the attribute:
print(article.find_all('h1', attrs  = {'id':'one'}))

output: [<h1 id="one">This is an h1 tag.</h1>]

example: The tags whose attribute id == ‘one’

print(article.find_all(lambda tag: tag.get('id') == 'one'))

output: [<h1 id="one">This is an h1 tag.</h1>]

Web Scraping Example Books to Scrape

The general workflow of a web scraping project is the following: 1. Identify which fields we want to scrape. If this means we need to visit different URLs to get to the final page with our relevant information, how do we get those URLs?
- For this project, we’re interested in the book title, UPC, price, genre, average rating, whether it’s in stock, and how many books available.
2. Find the unique attributes that will locate the tags of the elements that we’re interested in.
- What elements are we interested in on the home page? How do we get every element from each search page? What elements do we want on the book pages?
3. If there are multiple relevant pages in the project, test on the first page to make sure your code is extracting the correct elements.
- On the first results page, we can try to get all of the href attributes from each book product. We should get 20 total because there are 20 books per page.
4. Once we know that our code is extracting the correct data, we can generalize the code to apply the scraping operations to every page. 5. The final data (the book information) should be stored in a dict. This will make saving the data easier.
6. Once all the dict objects are populated, we can save the data locally as a JSON or CSV file (or other formats such as saving to SQL database or pickle file).


from bs4 import BeautifulSoup
# request source code format, get request
import requests
# user agent is the software used to acess the site
# google, what is my user agent and copy/paste
headers = {'User-Agent':
           'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
}
# check the response, 200 is good to go
response = requests.get('https://books.toscrape.com/', headers = headers)
text = BeautifulSoup(response.text, 'html.parser')

We can see that each book product has a div tag with class attribute “image_container.” This tag has a child a tag that has an href attribute. The href attribute value is the URL to each of our book pages.

\# We can check the length of the object that is returned to
\# (loosely) confirm that we are getting the correct content
len(text.find_all('div', attrs = {'class': 'image_container'}))

\# get the child a tags and subsequent href values
\# start with the location of the div tags
book_divs = text.find_all('div', attrs = {'class': 'image_container'})
\# use list comprehension to iterate over each tage to find the child a tag and get the href attribute
book_urls = [tag.find('a').get('href') for tag in book_divs]
print(book_urls[:5])
\# results are missing the domain
\#concat the domain url to book_urls
complete_book_urls = [f'https://books.toscrape.com/{end_part}' for end_part in book_urls]
print(complete_book_urls[:5])

Now we have complete URLs that we can pass to our requests.get() function to retrieve the source code for these book pages. How do we get the URLs for all The Books on the 50 pages of book results?
Step 3: Find the URL Pattern
If we click on the “next” button on the bottom right of the page, we’ll see a small change in the URL in our address bar.

Hopefully, you can see the pattern here. After the first page, we have an additional “catalogue/page-X.html” part added to the URL. In fact, if we add this part to the first URL, we should get the first page if we change it to “page-1.”

Try it out: https://books.toscrape.com/catalogue/page-1.html

If we are visiting multiple pages, it is a requirement that the URL changes as well. If we click the “next” button and we see that the URL does not change, we would probably need to use another web scraping tool called Selenium which is able to scrape dynamic web elements.

Our goal now is to generate a URL for every results page. Since we have identified the pattern, and we can see how many pages there are in total at the bottom of the screen, this should be a simple operation using list comprehension.

results_pages = [f'https://books.toscrape.com/catalogue/page-{i}.html' for i in range(1,51)]
print(results_pages[:5])

Now that we have all of the URLs for the different results pages, and we have the code to find all the book page URLs for any given page, let’s now move on to getting the data from each book page.

Let’s create another BeautifulSoup object using the source code from a book page. We can use the first book on the first page as an example for testing.

response = requests.get('https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
                        headers = headers)
text = BeautifulSoup(response.text, 'html.parser')

Remember, the fields we are interested in scraping are:

  • title
  • price
  • average rating
  • genre
  • UPC
  • how many books available
  • whether the book is in stock
# Book title
title = text.find('h1').string
print(title)

# The above only works if this is the first or only h1 tag 
# in the entire source code. We can be safer by including
# the parent div tag with it's identifying attribute.
title = text.find('div', attrs = {'class': 'col-sm-6 product_main'}).find('h1').string
print(title)
# Book price
price_in_pounds = text.find('p', attrs = {'class':'price_color'}).string
print(price_in_pounds)

# Note that there is a strange character before the pounds symbol.
# We will discuss how to handle this later.
# Average rating
# This one is a little tricky. Where can we find the
# number of stars for the average rating?

# What is this anonymous function doing?
avg_rating_tag = text.find(lambda tag: 'star-rating' in tag.get('class') if tag.get('class') else False)
avg_rating = avg_rating_tag.get('class')
print(avg_rating)

avg_rating = avg_rating[1]
print(avg_rating)

# Note that this is still in string format. We will handle this later.
# Genre
# Where is the genre of the book located on the page?

li_tag = text.find('ul', attrs = {'class':'breadcrumb'}).find_all('li')[2]
genre = li_tag.find('a').string
print(genre)
# UPC
# What is the benefit of scraping the UPC code?

# Why is there no tbody tag between the 'table' and 'tr' tags?
# chrome will sometimes add a tbody tag in the inspect view for tables, it is not actually in the source code
tr_tag = text.find('table', attrs = {'class':'table table-striped'}).find_all('tr')[0]
upc = tr_tag.find('td').string
print(upc)
# How many books are available?
# There are two places where this information is available,
# the tag is easier to indetify in the area by the book title.

num_books_available = text.find('p', attrs = {'class':'instock availability'}).get_text()
print(num_books_available)

Step 5: Bringing the Code Together

We now have all of the code necessary to scrape all the book page information for every book listed on every results page. Let’s think about how our code should traverse these different levels.

  1. Generate a list of results page URLs.
  2. Create an empty list to hold the book page URLs.
  3. Get the source code for every results page and scrape all of the book page URLs, add these URLs to the URL list (after adding the domain to each URL).
  4. After we have all the book page URLs, get the source code for each book page and scrape the relevant information.
  5. Store the information in an appropriate data structure.
  6. We either save these data structures in a list, or we can write the book information to a CSV file (or use a number of other options for persisting the data).
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
            }

# Generate the results page URLs
results_pages = [f'https://books.toscrape.com/catalogue/page-{i}.html' for i in range(1,51)]

# Create empty list to hold book page URLs
book_page_urls = []

# For each results page, scrape the URLs for each book page
for url in results_pages:
    response = requests.get(url, headers=headers)
    text = BeautifulSoup(response.text, 'html.parser')
    
    if response.status_code != 200:
        raise Exception(f'The status code is not 200! It is {response.status_code}.')

    # Get the book page URLs and add the domain, then add to book page URL list
    book_divs = text.find_all('div', attrs={'class': 'image_container'})
    book_urls = [tag.find('a').get('href') for tag in book_divs]
    complete_book_urls = [f'https://books.toscrape.com/catalogue/{end_part}' for end_part in book_urls]
    # What is the difference between append and extend?
    book_page_urls.extend(complete_book_urls)

Note: There is another way to approach this process. We could scrape the book page URLs on one results page and then visit every book page to get the book information before moving on to the next results page. Either way is fine, but I prefer to get all the links first and then visit them later.

Let’s now visit each book page and use the code that we tested earlier to get all of the book information.

# Create an empty list to hold our book information
book_list = []

# Let's just get the first 100 so this doesn't take very long
for url in book_page_urls[:100]:
    # We'll store our book information in dicitonaries so
    # we should create a new dict for every iteration
    book_dict = {}
    
    response = requests.get(url, headers = headers)
    text = BeautifulSoup(response.text, 'html.parser')
    
    if response.status_code != 200:
        raise Exception(f'The status code is not 200! It is {response.status_code}.')
    
    # Get title
    title = text.find('div', attrs = {'class': 'col-sm-6 product_main'}).find('h1').string
    # Get price
    price_in_pounds = text.find('p', attrs = {'class':'price_color'}).string
    # Get average rating
    avg_rating_tag = text.find(lambda tag: 'star-rating' in tag.get('class') if tag.get('class') else False)
    avg_rating = avg_rating_tag.get('class')[1]
    # Get genre
    li_tag = text.find('ul', attrs={'class':'breadcrumb'}).find_all('li')[2]
    genre = li_tag.find('a').string
    # Get UPC
    tr_tag = text.find('table', attrs = {'class':'table table-striped'}).find_all('tr')[0]
    upc = tr_tag.find('td').string
    # Get number of books available
    num_books_available = text.find('p', attrs = {'class':'instock availability'}).get_text()
    
    # Store info in the dictionary
    book_dict['title'] = title
    book_dict['price_in_pounds'] = price_in_pounds
    book_dict['avg_rating'] = avg_rating
    book_dict['genre'] = genre
    book_dict['upc'] = upc
    book_dict['num_books_available'] = num_books_available
    
    # Add dictionary to list
    book_list.append(book_dict)
book_list[:5]

Step 6: Persist the Data
Now that we have all of our book information, we need to decide on a way to save our data. There are a few different ways we can save this data:

  1. Save to CSV
  2. Save to JSON
  3. Save to pickle

There are advantages and disadvantages for each format. Let’s talk about them.

CSV
Saving data as a CSV file is probably the most common strategy. CSV (Comma Separated Value) files are a very standard file format (think Excel spreadsheet). We can even load them directly into a pandas DataFrame.

We can use the csv module to help us save the data to CSV format. Here is the documentation.

import csv

# Create a new CSV file
with open('books.csv', 'w', encoding = 'utf-8', newline='') as csvfile:
    # Create the CSV writer object
    book_writer = csv.writer(csvfile)
    # Write the initial header row
    book_writer.writerow(['title', 'price_in_pounds', 'avg_rating', 'genre', 'upc', 'num_books_available'])
    
    # Iterate over the list of dicts and write the values to CSV
    for book_dict in book_list:
        book_writer.writerow(book_dict.values())

JSON JSON (JavaScript Object Notation) is another file format we can use. You mostly encounter JSON data when working with JavaScript or retrieving data from an API. JSON files look like a list of dictionaries. Often these dictionaries will have nested dictionaries as values. It’s good to know how to work with this file format. Here is the documentation.

import json

# This will create a string version of the JSON
# Note the change in encoding for the pound symbol
json.dumps(book_list)

# This will save the JSON to a file
with open('books.json', 'w', encoding = 'utf-8') as jsonfile:
    json.dump(book_list, jsonfile, indent=4)
    
# To load the JSON back, use the json.load() function
with open('books.json', 'r', encoding = 'utf-8') as jsonfile:
    loaded_books = json.load(jsonfile)

Pickle Pickle is a way of saving any type of Python object to a file. We can then load this file back into an environment to regain access to the exact same object (in the same state). This is very useful when working with objects that took a long time to create (like a trained machine learning model) or when saving objects in Python that aren’t as obvious to convert into another format. Here is the documentation.

import pickle

# This will save the object as a pickle
# Note the wb mode here
with open('books.pickle', 'wb') as picklefile:
    pickle.dump(book_list, picklefile)

# This will load the pickle file
# Note the rb mode here
with open('books.pickle', 'rb') as picklefile:
    loaded_books = pickle.load(picklefile)

Arrays

The type ndarray is flexible because it supports a great variety of primitive data types: subject to the homogeneity condition, which means every element in ndarray has the same type.

import numpy as np
nested_lst1 = [[-1,2,1,2],[5,-6,5,6],[7,8,-7,8]]
nested_lst2 = [[1,2,1,2],[5,6,5,6],[7,8,7,8]]
multi_ary1 = np.array(nested_lst1)
multi_ary2 = np.array(nested_lst2)
higher_dim = np.array([multi_ary1, multi_ary2])
higher_dim
## array([[[-1,  2,  1,  2],
##         [ 5, -6,  5,  6],
##         [ 7,  8, -7,  8]],
## 
##        [[ 1,  2,  1,  2],
##         [ 5,  6,  5,  6],
##         [ 7,  8,  7,  8]]])

Array Functions

np.arange(10)
## array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.arange(2,10) #start, #stop
## array([2, 3, 4, 5, 6, 7, 8, 9])
np.arange(2,10,3) #start, #stop, #step
## array([2, 5, 8])
np.linspace(0,10, 51) #(start, end, no of elements)
## array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ,  1.2,  1.4,  1.6,  1.8,  2. ,
##         2.2,  2.4,  2.6,  2.8,  3. ,  3.2,  3.4,  3.6,  3.8,  4. ,  4.2,
##         4.4,  4.6,  4.8,  5. ,  5.2,  5.4,  5.6,  5.8,  6. ,  6.2,  6.4,
##         6.6,  6.8,  7. ,  7.2,  7.4,  7.6,  7.8,  8. ,  8.2,  8.4,  8.6,
##         8.8,  9. ,  9.2,  9.4,  9.6,  9.8, 10. ])
np.ones((3,2)) #(array of ones of dimension 3x2)
## array([[1., 1.],
##        [1., 1.],
##        [1., 1.]])
np.zeros((4,3,2)) #array of zeroes of dimension 4x3x2)
## array([[[0., 0.],
##         [0., 0.],
##         [0., 0.]],
## 
##        [[0., 0.],
##         [0., 0.],
##         [0., 0.]],
## 
##        [[0., 0.],
##         [0., 0.],
##         [0., 0.]],
## 
##        [[0., 0.],
##         [0., 0.],
##         [0., 0.]]])
high_x1 = np.array([[1,2,3],[4,5,6]])
high_x1.shape
## (2, 3)
x = np.arange(8)
x.reshape([2,4], order='F') #reshapes array to 2x4 fills by columns
## array([[0, 2, 4, 6],
##        [1, 3, 5, 7]])
x.astype(float) # change datatype of array elements
## array([0., 1., 2., 3., 4., 5., 6., 7.])

Matrix

x = np.matrix([3,2])
print(type(x))
## <class 'numpy.matrix'>
print(x.shape)
## (1, 2)
x
## matrix([[3, 2]])

Matrix Multiplication

The type matrix is for linear algebra, in which the matrix multiplication is always “a row times a column”:

\[ (1,2,3) \times \left( \begin{array}{c} 1\\ 2\\ 3\\ \end{array} \right) = (1+4+9) = 14 \] ### Identity Matrix

Id = np.matrix(np.eye(3))
Id
## matrix([[1., 0., 0.],
##         [0., 1., 0.],
##         [0., 0., 1.]])

Inverse Matrix

The identity matrix is analogous to the number one. One is referred to as the multiplicative identity, since any number multiplied by one remains unchanged. A nonzero number has a multiplicative inverse, or reciprocal. The product of a number and its multiplicative inverse is the multiplicative identity, one.

For real scalars:

  • nonzero number: n
  • multiplicative identity: 1
  • multiplicative inverse: \(\frac{1}{n}\) \[n \times 1 = n\]

\[n \times \frac{1}{n}=1\]

The analogy for a square matrix is called an inverse matrix. A nonsingular square matrix has a multiplicative inverse, or inverse matrix. The product of a matrix and its inverse matrix (in either order) is the identity matrix.

For real-valued square matrices:

  • nonsingular matrix: \(M\)
  • identity matrix: \(\textbf{1}\)
  • inverse: \(M^{-1}\) \[M \cdot \textbf{1} = M\]

\[M \cdot M^{-1}=\textbf{1}\]

\[M^{-1} \cdot M=\textbf{1}\]

  • How do you find a tuple of x, y and z satisfying all the equations below? \[ \begin{eqnarray} 3x +2y -z &= 1\\ 2x -2y +4z &=-2\\ -x +\frac{1}{2}y- z&= 0 \end{eqnarray} \]

Observe that the equation can be written as:

\[ \begin{pmatrix} 3 &2 &-1 \\ 2 &-2 &4 \\ -1&\frac{1}{2}&-1 \end{pmatrix} \times \begin{pmatrix} x\\y\\z \end{pmatrix} = \begin{pmatrix} 1\\-2\\0 \end{pmatrix} \]

\[\begin{pmatrix} x\\y\\z \end{pmatrix}= \begin{pmatrix} 3 &2 &-1 \\ 2 &-2 &4 \\ -1&\frac{1}{2}&-1 \end{pmatrix}^{-1} \times \begin{pmatrix} 1\\-2\\0 \end{pmatrix}\]

M = np.matrix([[3, 2, -1],[2, -2, 4],[-1, 0.5, -1]])
M2 = np.matrix([[1], [-2], [0]])
M.I * M2
## matrix([[ 1.],
##         [-2.],
##         [-2.]])

Classes

Attributes and Methods

  • We’ll go into details in a bit, but here is the syntax to define a simple class:
class Classname(object):
    def __init__(self):
        initialize representation by assigning to variables

    def methodname(self, ...args...):
        define method; 
        change representation or return value or both; 
        use self.var to refer to variable var defined in init.
  • In __init__, we assign the desired representation to one or more variables, so we just have to decide what their names will be.

  • Once we have this class, we can create instances of it:

newobj = Classname()
  • We invoke methods using object notation:
newobj.methodname(...args...)
  • Note that even though we defined the method using ordinary function definition syntax, we must call it using object syntax. That is just because it is defined inside a class.
  • However, in Python, there is an attribute naming convention to denote private attributes by prefixing the attribute with one or two underscores, e.g:
self._coords
self.__coords
  • Person is the class
  • name and age are each an object attribute
  • say is an object method
class Person():
    '''This is the documentation of the class person ... here'''
    
    def __init__(self,a_name,a_age):#constructor
        self.name=a_name
        self.age=a_age
        
    def intro(self):
        print("My name is",self.name,"and I am",self.age,"years old")

Special Name method

use the __str__ special method
and the __add__ special method
second arg in an attribute definition is ussually labeled as other

class Vector(object):
    def __init__(self, lis):
        self.coords = lis

    def length(self):
        return sum([x**2 for x in self.coords])**.5
    
    def __str__(self):
        return 'Vector' + str(self.coords)
    def __add__(self, other):
        return Vector(list(map(lambda x, y: x+y, self.coords, other.coords)))
    
# Then we print the Vector object:
vec_1 = Vector([1,2,3])
print(vec_1)
## Vector[1, 2, 3]

Exercise

  • Now our Vector class looks like this:
class Vector(object):
    def __init__(self, lis):
        self.coords = lis

    def length(self):
        return sum([x**2 for x in self.coords])**.5

    def __add__(self, other):
        return Vector(map(lambda x, y: x+y,
                          self.coords, other.coords))

    def __str__(self):
        return 'Vector'+str(self.coords)
  • Add two more methods to the class:
    __eq__(vec): returns True if this vector equals vec.

  • u == v calls u.__eq__(v).

  • __mul__(vec): returns the dot product of this vector and vec. The dot product is defined by: (x, y, …) * (x’, y’, …) = xx’ + yy’ +

  • u * v calls u.__mul__(v)

  • Then evaluate the following expressions ( equality and \(cos(\theta)\)):

class Vector(object):
    def __init__(self, lis):
        self.coords = lis

    def length(self):
        return sum([x**2 for x in self.coords])**.5

    def __add__(self, other):
        return Vector(list(map(lambda x, y: x+y, self.coords, other.coords)))

    def __str__(self):
        return 'Vector'+str(self.coords)
    
    def __eq__(self, other):
        return list(self.coords) == list(other.coords)
        
    
    def __mul__(self, other):
        return sum(list(map(lambda x,y: x*y, self.coords, other.coords)))
u = Vector([1,1,0])
v = Vector([0,1,1])
print(u == v)                      
## False
print((u*v) / (u.length()*v.length()))
## 0.4999999999999999

Inhereitance

inherited classes do not need an init statement

class Book(object):
    def __init__(self, name, author = None):
        self.name = name
        self.author = author
    
    def __str__(self):
        return '<%s> by %s' % (self.name, self.author)
class EBook(Book):
    def __init__(self, name, author = None):
        Book.__init__(self, name, author)

create a mutating method

class Book(object):
    def __init__(self, name, author = None):
        self.name = name
        self.author = author
    def __str__(self):
        return '<%s> by %s' %(self.name, self.author)
    def rename(self, newname):
        if not isinstance(newname, str):
            raise TypeError('please enter a valid str value for new name')
        self.name = newname

book_1 = Book('The little SAS book', 'Lora D. Delwiche')
print("The name of book_1 is originally %s" % book_1)
## The name of book_1 is originally <The little SAS book> by Lora D. Delwiche
book_1.rename('The SAS book')
print("The name of book_1 is now %s" % book_1)
## The name of book_1 is now <The SAS book> by Lora D. Delwiche

Class Method

uses the class itself as an argument in a method to return some object

class Date(object):

    def __init__(self, day = 0, month = 0, year = 0):
        self.day = day
        self.month = month
        self.year = year
        
    
    @classmethod
    def from_string(cls, date_as_string):
        '''
        This function takes a string as input and returns a Date object.
        The input string needs to follow the format of 'dd-mm-yyyy'
        '''
        day, month, year = list(map(int, date_as_string.split('-')))
        date1 = cls(day, month, year)
        return date1
        
# usage
date2 = Date.from_string('23-08-1990')
date2.year
## 1990
  • We’ve implemented date string parsing in one place and it’s reusable now.
  • @classmethod is a special syntax sugar called decorator in Python.
  • Decorator takes a function as an input and returns the modified function as the output. i.e. convert the from_string function to a class method.
  • cls is an object that holds class itself, not an instance of the class. It’s pretty cool because if we inherit our Date class, all children will have from_string defined also.

Regular Expressions

https://learnbyexample.github.io/python-regex-cheatsheet/

Meta Characters

.^$*+?{}[]\|()

Import and basic methods

import re
raw_string = 'Hi, how are you today?, Hi'
re.search('Hi', raw_string)
print(s.start()) # the starting position of of the matched string
print(s.end())   # the ending position index of the matched string
print(s.span())  # a tuple containing the (start, end) positions of the matched string
print(s.group()) # the matched string
# split by common seperators
print(re.split(re.split('[\n ,.-]+', s))) #Split the string into a list by the pattern. 
# find all letters
print(re.findall('[a-zA-Z]+', s)) # Find all substrings where the pattern matches, and return them as a list.

dot

. refers to any single character
a. matches any 2 characters that start with a

print(re.search('a.', 'aa') != None)
print(re.search('a.', 'ab') != None)
print(re.search('a.', 'a1') != None)
print(re.search('a.b', 'a+b') != None)
print(re.search('a.b', 'a+x+b') != None)
print(re.search('../../201.', 'From 06/01/2015') != None)
s = re.search('(..)/(..)/(201.)', 'From 06/01/2015')
print(s.group(1))

Question mark, plus, asterisk, and {}

? matches the preceding expression either once or zero times.

+ matches the preceding expression character at least once.

* matches the preceding expression character arbitrary times.

{m,n} matches the preceding expression at least m times and at most n times.

For example, ba?b matches bab and bb.

caret / dollar sign

^ refers to the beginning of a text, while $ refers to the ending of a text.

For example, ^a matches any text that begins with character a.

a$ matches any text ending with character a.

bracket

[] is used to specify a set of characters that you wish to match. For example, [123abc] will match any of the characters 1, 2, 3, a, b, or c ; this is the same as [1-3a-c], which uses a range to express the same set of characters. Further more [a-z] matches all lower letters, while [0-9] matches all numbers.
Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters ‘(’, ‘+’, ’*‘, or’)‘.
Characters that are not within a range can be matched by complementing the set. If the first character of the set is’^’, all the characters that are not in the set will be matched.

vertical bar

is a logical operator. For examples, a|b matches a or b, which is similar to [ab]. abc|123 matches abc or 123, while [abc123] matches any single characters in a, b, c, 1, 2, 3.

Special sequence in regular expression

There are some special sequences that have special meaning in regular expression.

\d: Matches any decimal digit; this is equivalent to the class [0-9].

\D: Matches any non-digit character; this is equivalent to the class [^0-9].

\w: Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

\W: Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

\s: Matches any whitespace character; this is equivalent to the class [ .

\S: Matches any non-whitespace character; this is equivalent to the class [^ .

Files

Read file as text

!type <name of file>

pickle files

Opening Files

f = open('fooz.txt', 'r') 
f.read()
f.close()
# reads entire file as a string
# 'r' opens files for reading
# 'a' opens file for appending
# 'w' opens file for overwriting
# 'r+' read and append

Reading Files

f = open('fooz.txt', 'r')
lines=f.readlines()
print(lines)
f.close()
# reads each line as list

Iterating Files

f = open('fooz.txt', 'r')
for i in f:
    print(i)
f.close()
# lets capitalize everything in our file
text = ''.join(list(map(lamba s: s.upper(), lines)))
print(text)

Write a function to strip the new line characters out of a file

def e_to_a(filename):
    f = open(filename, 'r')
    lines = f.readlines()
    lines = list(map(lambda x: x.strip(), lines))
    f.close()
    return lines

Writing files

f.write('this is some text.\nThis is another line.\n')
f.write('oh snap, more lines.\nIt\'s a line.\n')
f.close()

Random Sampling

The numpy.random submodule provides random-variable generator. Some are listed below:

  • randn(d0, d1, ..., dn): Each entry is from the “standard normal” distribution.

  • rand(d0, d1, ..., dn): Random array with a given shape where each element is from Uniform [0, 1].

  • randint(low, high, size): Random integers ranging from low (inclusive) to high (exclusive).

  • random_integers(low, high, size): Random integers between low and high, both inclusive.

  • random_sample(size): Random floats in the half-open interval [0.0, 1.0).

  • choice(a[, size, replace, p]): Generates a random sample from a given 1-D array. By default, replace=True.

import numpy as np
np.random.randn(10)
## array([-0.32274891, -0.1289802 , -0.11841044,  0.1860917 , -1.61890309,
##         0.15605268,  0.59388918,  0.86125404,  0.63192088,  0.15026827])

Scipy

import numpy as np
from scipy import stats

from sklearn import datasets
iris_raw = datasets.load_iris()
iris = iris_raw.data
iris[0:8,:]
sepal_len = iris[:,0] # gets first column, sepal length
sepal_wid = iris[:,1] # gets 2nd column, sepal width
petal_len = iris[:,2] # gets column 3, petal length

Statistical Functions

stats.describe(sepal_len)
np.unique(species, return_counts = True) 
stats.ttest_1samp(petal_len, 10) # 1 sample t test
stats.ttest_ind(sepal_len, petal_len) #2 sample t test
stats.f_oneway(sepal_len, sepal_wid, petal_len) # ANOVA

Distributions

my_binom = stats.binom(6, 0.5) # binomial distribution
my_binom.rvs(10) # random sampling
my_binom.pmf(3) # returns likelyhood
my_norm = stats.norm(3, 1) # normal distribution with mean = 3, stdev = 1
my_norm.cdf(4) #culmulative density object
my_norm.pdf(4)) # probability density function

PANDAS

Series

import pandas as pd
obj = pd.Series(['a', 'b', 'c', 'd'])
obj
## 0    a
## 1    b
## 2    c
## 3    d
## dtype: object

Slicing

obj[0]
## 'a'
obj.index
## RangeIndex(start=0, stop=4, step=1)
obj.values
## array(['a', 'b', 'c', 'd'], dtype=object)

Convert Dictionary to series

dict_ = {1: 'a', 2: 'b', 3: 'c', 4: 'd'}
obj3 = pd.Series(dict_)
obj3
## 1    a
## 2    b
## 3    c
## 4    d
## dtype: object
obj3.to_dict() # and back to dictionary
## {1: 'a', 2: 'b', 3: 'c', 4: 'd'}

Dataframes

Creating a dataframe

# From Dictionary
import pandas as pd
data = {'commodity': ['Gold', 'Gold', 'Silver', 'Silver'],
        'year': [2013, 2014, 2014, 2015],
        'production_Moz': [107.6, 109.7, 868.3, 886.7]} 


df = pd.DataFrame(data)
df
# from nested list
##   commodity  year  production_Moz
## 0      Gold  2013           107.6
## 1      Gold  2014           109.7
## 2    Silver  2014           868.3
## 3    Silver  2015           886.7
df_2=pd.DataFrame([[107.6, 'Gold', 2013],
                   [109.7, 'Gold', 2014],
                   [868.3, 'Silver', 2014],
                   [886.7, 'Silver', 2015]], 
                    columns=['production_Moz','commodity','year'])
df_2
##    production_Moz commodity  year
## 0           107.6      Gold  2013
## 1           109.7      Gold  2014
## 2           868.3    Silver  2014
## 3           886.7    Silver  2015

can also create a dataframe with a nested list

df_2 = pd.DataFrame([[107.6, 'Gold', 2013],
                   [109.7, 'Gold', 2014],
                   [868.3, 'Silver', 2014],
                   [886.7, 'Silver', 2015]], 
                    columns = ['production_Moz','commodity','year'])
df_2
##    production_Moz commodity  year
## 0           107.6      Gold  2013
## 1           109.7      Gold  2014
## 2           868.3    Silver  2014
## 3           886.7    Silver  2015

DF Methods

df.values
df.index
df=df.set_index('commodity') # sets index to commodity column
df=df.reset_index() # has a drop argument, defaults to false

df.columns
df.index
df.tolist() # creates a list of column names
# can appply the string sub module to use string methods
#df.columns.str.replace()

DF slicing

# First index is column, second index is a row
# how to get columns
df.commodity
## 0      Gold
## 1      Gold
## 2    Silver
## 3    Silver
## Name: commodity, dtype: object
df['commodity']
## 0      Gold
## 1      Gold
## 2    Silver
## 3    Silver
## Name: commodity, dtype: object
df.iloc[1]
## commodity          Gold
## year               2014
## production_Moz    109.7
## Name: 1, dtype: object
df.loc[:,'commodity'] #row : Col
## 0      Gold
## 1      Gold
## 2    Silver
## 3    Silver
## Name: commodity, dtype: object
df[['commodity']] # returns a df
##   commodity
## 0      Gold
## 1      Gold
## 2    Silver
## 3    Silver
df['commodity'][0] # returns comodity column row 0
## 'Gold'
df[['commodity', 'year']]
##   commodity  year
## 0      Gold  2013
## 1      Gold  2014
## 2    Silver  2014
## 3    Silver  2015
df.loc[df['commodity'] == 'Gold', :]
##   commodity  year  production_Moz
## 0      Gold  2013           107.6
## 1      Gold  2014           109.7
df.loc[:, df.loc[0] == '2013']
#df.loc[row,col]
# returns all rows for the columns in which row 'one' is a value.
# df1.loc[:, df1.loc['One'] ==  #value]
## Empty DataFrame
## Columns: []
## Index: [0, 1, 2, 3]

DF Functions

df.sort_values('value') # column to sort on
df.concat(df1, df2) #always combines on index
pd.merge() # df1, df2, how(left, inner), on(column to merge on)
# if cols dont have the same name  use 'left_on' and 'right_on' to indicate how to perform the merge.
# df.rename({'oldname':'newname'}, axis =1)

Data Merging

df1 = pd.DataFrame(np.arange(9).reshape((3, 3)), 
                   columns = ['a', 'b', 'c'],
                   index = ['one', 'two', 'three'])
df2 = pd.DataFrame(np.arange(6).reshape((3, 2)), 
                   columns = ['d','e'],
                   index = ['three', 'two','one'])
print(df1)
##        a  b  c
## one    0  1  2
## two    3  4  5
## three  6  7  8
print(df2)
##        d  e
## three  0  1
## two    2  3
## one    4  5

Since the two data frames have the same number of rows, it is natural to combine them “horizontally”.
Note the concatenation takes place on the name of the index and not the order.
Merges based on index

pd.concat([df1, df2], axis = 1, sort = False)
##        a  b  c  d  e
## one    0  1  2  4  5
## two    3  4  5  2  3
## three  6  7  8  0  1

hstack will merge without matching indices

np.hstack((df1.values, df2.values))
## array([[0, 1, 2, 0, 1],
##        [3, 4, 5, 2, 3],
##        [6, 7, 8, 4, 5]])

example if indices dont match and you use concat, populates with Nan values

df1 = pd.DataFrame(np.arange(9).reshape((3, 3)), 
                   columns = ['a', 'b', 'c'],
                   index = ['One', 'two', 'three'])
df2 = pd.DataFrame(np.arange(6).reshape((3, 2)), 
                   columns = ['d','e'],
                   index = ['three', 'two','one'])
print(df1,'\n\n',df2)
##        a  b  c
## One    0  1  2
## two    3  4  5
## three  6  7  8 
## 
##         d  e
## three  0  1
## two    2  3
## one    4  5
print(pd.concat([df1, df2], axis = 1, sort = False))
##          a    b    c    d    e
## One    0.0  1.0  2.0  NaN  NaN
## two    3.0  4.0  5.0  2.0  3.0
## three  6.0  7.0  8.0  0.0  1.0
## one    NaN  NaN  NaN  4.0  5.0

can use an inner join

pd.concat([df1, df2], axis = 1, join = 'inner')
##        a  b  c  d  e
## two    3  4  5  2  3
## three  6  7  8  0  1

Apply

apply applies a function on 1D arrays of columns and rows

df1.apply(lambda x: max(x), axis = 0) # 0 stands for apply to each column

will return max value for each column

If you just want to apply the function to a single column, you can extract that specific series first and then call the map() method just like the map operator in Python.

df1.a.map(lambda x: x+1)

additional techniques

how to group columns with multiple substrings in a cell

# strings seperated by |
df_genre.genres = df_genre.genres.apply(lambda x: x.split('|'))
# turns genre column into a list of strings
index=0
# save [index, genre] in a nested list
list_ = []
for item in df_genre.genres:
    list_.extend(map(lambda x: [index, x], item))
    index += 1
genre = pd.DataFrame(list_, columns = ['index', 'genres'])
genre.head() 
#gives a dataframe 

Group

df.groupby('colname')
# groups are iterable
for item in group:
  print item
# make a grouypby dictionary
group_dict = dict(iter(group))
group.agg(['count', 'mean', 'min', 'max', 'sum', 'std'])

Read from file

df=pd.read_csv('foo.csv')
df

DF Null values

df.isnull() # returns t/f dataframe
np.sum(df.isnull()) # sums nulls per column
one      2
two      2
three    2
four     1
dtype: int64
np.sum(df_miss.isnull(), axis=1) # sums nulls per row
df.dropna(axis=0, how='any') # drops rows with NA values
df.fillna(0) # fills NA values
df['one'].fillna(df_miss['one'].mean()) # fills all NAs in 'one' column with the mean
df.interpolate() # fills NA values with a linear regression of surrounding values
# create a boolean mask for missing values
# mask = df_miss.isnull().any(axis = 1)
# df_miss.loc[~mask,:] will return only non NAN values the tilda operator is negation

MatplotLib

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
plt.style.use('ggplot')
x=[5,5,5,10,10,30,30,20,25,40]
plt.hist(x)
plt.show()
values=[1628701 ,2582830, 1432132, 476179 , 2278906]
labels=['Manhattan','Brooklyn','Bronx','Staten Island', 'Queens']
plt.pie(values,labels=labels)
plt.boxplot(x)
x=[7,8,9,11]
y=[9,4,6,7]
plt.bar(x,y);
plt.barh(x,y);
plt.scatter(x,y);

Outliers

to remove outliers more than n std deviations out
scatter_df = scatter_df.loc[scatter_df.apply(lambda x: np.abs(x - x.mean()) / x.std() < n).all(axis = 1)]

Histogram

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df = pd.read_csv('https://s3.amazonaws.com/nycdsabt01/movie_metadata.csv')
plt.hist(df['imdb_score'], bins = 20, color = "#5ee3ff")
## (array([  5.,  11.,  12.,  26.,  51.,  51.,  95., 127., 217., 362., 497.,
##        661., 795., 720., 666., 419., 226.,  81.,  18.,   3.]), array([1.6  , 1.995, 2.39 , 2.785, 3.18 , 3.575, 3.97 , 4.365, 4.76 ,
##        5.155, 5.55 , 5.945, 6.34 , 6.735, 7.13 , 7.525, 7.92 , 8.315,
##        8.71 , 9.105, 9.5  ]), <a list of 20 Patch objects>)

Scatterplot

plt.scatter(df['budget'], df['gross'])
plt.xlabel('Budget')
plt.ylabel('Gross Income')

### Barplot barplot with a groupby

plt.figure(figsize = (12,6))
df.groupby('country')['imdb_score'].median().sort_values(ascending=False).plot.bar(color = 'b')

Plotting Functions

X=np.linspace(0,4*3.14,101)
Y=np.sin(X)
plt.plot(X,Y,'.')

Plotting Features

plt.figure(figsize=(12,6)) #figsize must be first declaration
plt.plot(x,y,'g.-',markersize=20, linewidth=4,alpha=0.4,label='data1')
plt.plot(x,y2,'.--',color='red',markersize=10,alpha=0.7,linewidth=0.3,label='data2')
plt.legend(loc=1)
plt.title('Example Plot')
plt.xlabel('horizontal axis')
plt.ylabel('vertical axis')
plt.xlim(6,12)
plt.ylim(0,10)
plt.xticks([6,9,12])
plt.yticks([0,4,8])
plt.text(10,2,'some text',color='blue',size=14)
plt.show()

Seaborn

import seaborn as sns
titanic=sns.load_dataset('titanic')
sns.countplot(x="deck", data=titanic);
plt.hist(titanic['deck'].dropna());
sns.stripplot(x="sex",y="age",data=titanic)
sns.swarmplot(x="sex",y="age",data=titanic)
sns.barplot(x="sex",y="survived", data=titanic)
sns.barplot(x="sex",y="survived", hue='class', data=titanic)

R

Markdown

Formatting

For basic formatting for your output HTML document
(how to update your YAML Preamble):

https://bookdown.org/yihui/rmarkdown/html-document.html

HTML

colors, fonts, etc… (most tags will not work if placed around codechunks, or will not work as expected)

https://daringfireball.net/projects/markdown/syntax

Printing

x = 2
# Cat automatically converts to string if needed
cat("x =", x, "\n")
## x = 2

Basic Mathmatical functions

3^2
## [1] 9
# or
3**2
## [1] 9
#integer division
5%/%2
## [1] 2
#remainder
5%%2
## [1] 1

Converting Value types

as.integer(5.6)
## [1] 5
as.character(5.6)
## [1] "5.6"

Data Types

Vector

# creating a vector
# must be same data type
x = c(1,2,3,4)
cat(x)
## 1 2 3 4
cat(c(x,x))
## 1 2 3 4 1 2 3 4
# seq(start, end (inclusive, step))
seq(1,10)
##  [1]  1  2  3  4  5  6  7  8  9 10
seq(1,10,2)
## [1] 1 3 5 7 9
# automatically calculate increment
seq(1,10, length=5)
## [1]  1.00  3.25  5.50  7.75 10.00
rnorm(10) # default mean 0 and standard deviation 1, normal distribution
##  [1]  0.8031474  0.6689984 -0.2303817 -0.8639633  0.6768409  1.2026651
##  [7]  0.2602346  0.2272572  0.3623046  1.2730424
runif(5) # default min 0 and max 1, uniform distribution
## [1] 0.5140753 0.6960861 0.6714345 0.2550422 0.4059086
sample(c('A','B'), size = 10, replace = TRUE) #samples from the vector, replace argument is weather the vector can b sampled again
##  [1] "B" "A" "B" "A" "A" "B" "B" "B" "A" "A"
# repetition vector creation
rep(c(1, 2, 3), 5)
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
# and so can the second argument (two vectors must have the same length)
rep(c(1, 2, 3), c(3, 5, 1))
## [1] 1 1 1 2 2 2 2 2 3
#You can obtain a subvector (a “slice”) of a vector in many ways. The simplest
#slice is a single element. We can get this by subscripting (indexes start at 1):
x = c(1, 2, 3, 4)
x[1] # use square brackets
## [1] 1
## [1] 1
#You can get a slice by subscripting with a vector:
x[c(1, 3)]
## [1] 1 3
## [1] 1 3
#You can subscript with the output of a function too:
x[seq(1, 4, by=2)]
## [1] 1 3
#You can also get a slice by subscripting with a logical vector:
x[c(T,F,T,T)]
## [1] 1 3 4
#you can extract all the elements except those specied by using
#negative indices:
x[c(-2, -4)] 
## [1] 1 3
v = 10:15
#We can modify an element of a vector using direct assignment.
v[1] <- 20
v
## [1] 20 11 12 13 14 15
## [1] 20 11 12 13 14 15
#You can also modify a set of elements by using slicing on the left-hand side of
#the assignment.
v[2:3] <- c(21, 22)
v
## [1] 20 21 22 13 14 15
## [1] 20 21 22 13 14 15

#We can add labels to elements of a vector; these will be printed above the
#values. Here, the right-hand side vector should be of the same length as the #lefthand side vector:
v = 10:15
names(v) = c('a','b','c','d','e','f')
v
##  a  b  c  d  e  f 
## 10 11 12 13 14 15
## a b c d e f
## 10 11 12 13 14 15
v['a'] # v[1] also works
##  a 
## 10
## a
## 10

#This allows us to select slices using complex conditions:
x[x>=2 & x<4] # x = c(1, 2, 3, 4)
## [1] 2 3
## [1] 2 3

Null Values

#To calculate any statistics of data that contain NAs, you might get an NA.
#temp = c(27, 29, 23, 14, NA)
mean(temp)
## [1] NA
#We can set the na.rm parameter to TRUE to ignore the NAs.
mean(temp, na.rm=TRUE)
## [1] 23.25

Functions

cost_per_person <- function(people, bill) {
  tax = bill * 0.07
  tip = bill * 0.18
  total = bill + tax + tip
  costPP = total / people
  return(costPP)
}
cost_per_person(5, 221.78)
## [1] 55.445
cost_per_person(4, 221.78)
## [1] 69.30625

Default Arguments

xyz = function(x=1, y, z=1) {
  return(x * 100 + y * 10 + z)
}

Hiden arguments

Can pass a hidden argument like na.rm=True in function definition to remove all NAs before running ... defines hidden parameters

SDcalc = function(x, type = 'sample', ...) {
#... same as before ...
n = length(x); mu = mean(x, ...)
if (type == 'sample') {
stdev = sqrt(sum((x-mu)^2, ...)/(n-1))
}
if (type == 'population') {
stdev = sqrt(sum((x-mu)^2, ...)/(n))
}
return(stdev)
}
test = c(1:10, NA)
SDcalc(test, type='sample', na.rm=TRUE)
## [1] 2.872281

Built In Functions

set.seed(0) # For reproducible results
x = rnorm(1000, mean=2) # Random sample from a normal distribution

mean(x) # mean of x
## [1] 1.98417
median(x) # median of x
## [1] 1.941129
var(x) # variance of x
## [1] 0.9960137
sd(x) # standard deviation of x
## [1] 0.9980048
quantile(x, c(0.25, 0.50, 0.75)) # quantiles of x
##      25%      50%      75% 
## 1.291544 1.941129 2.687639
sum(x) # sum of x
## [1] 1984.17

Working Directory

getwd()
## [1] "C:/Users/jmeis/NYC_DSA/CheatSheet"
#setwd("C:/Users/cmmaf/Desktop/RDA-50/1/Lecture")
#setwd("C:/Users/cmmaf/Desktop/RDA-50")

Reading and writing Files

#full_path = "C:/Users/cmmaf/Desktop/RDA-50/1/Lecture/data/CPS1988.csv"
#CPS1988 = read.csv(full_path, header=TRUE)

#relative_path = "data/CPS1988.csv"
#CPS1988 = read.csv(relative_path, header=TRUE)
library(AER)
## Loading required package: car
## Warning: package 'car' was built under R version 4.1.2
## Loading required package: carData
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
data("CPS1988")

dim(CPS1988)
## [1] 28155     7
# A data frame can be exported to a .txt le or a .csv le with write.table( ). We
# need to specify the data frame, the le name, and the eld separator string. By
# default, the indexes of the data frame will be the rst column of the table written
# into the le. If that is not desired, set the argument row.names = FALSE:
# write.table(citydf, file='my_data.csv' , sep=',', row.names=F)
# write.csv( ) can be used to export data frames as well. This is just write.table()
# with the eld separator set to comma:
# write.csv(citydf, file='my_data.csv', row.names=F)

Data Frames

Creatign Data Frames

#To create a data frame as in the previous slide, note that it basically consists of
#two vectors placed side by side. Let’s dene those vectors:
city = c('Seattle','Chicago','Boston','Houston')
temp = c(78, 74, 50, 104)
#The function data.frame( ) combines vectors into data frames as separate
#columns. They must be of the same length, but can be of different types.
citydf <- data.frame(city=city, temp=temp)
citydf
## city temp
## 1 Seattle 78
## 2 Chicago 74
## 3 Boston 50
## 4 Houston 104

Useful Methods

head(CPS1988)
# get dimensions of DF
dim(CPS1988)
## [1] 28155     7
# return structure of dataframe
str(CPS1988, vec.len=5)
## 'data.frame':    28155 obs. of  7 variables:
##  $ wage      : num  355 123 370 755 594 377 ...
##  $ education : int  7 12 9 11 12 16 8 12 12 14 12 8 ...
##  $ experience: int  45 1 9 46 36 22 51 34 0 18 17 42 ...
##  $ ethnicity : Factor w/ 2 levels "cauc","afam": 1 1 1 1 1 1 1 1 1 1 1 1 ...
##  $ smsa      : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 2 2 ...
##  $ region    : Factor w/ 4 levels "northeast","midwest",..: 1 1 1 1 1 1 1 1 1 1 1 1 ...
##  $ parttime  : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 1 1 ...
#get summary
summary(CPS1988)
##       wage            education       experience   ethnicity     smsa      
##  Min.   :   50.05   Min.   : 0.00   Min.   :-4.0   cauc:25923   no : 7223  
##  1st Qu.:  308.64   1st Qu.:12.00   1st Qu.: 8.0   afam: 2232   yes:20932  
##  Median :  522.32   Median :12.00   Median :16.0                           
##  Mean   :  603.73   Mean   :13.07   Mean   :18.2                           
##  3rd Qu.:  783.48   3rd Qu.:15.00   3rd Qu.:27.0                           
##  Max.   :18777.20   Max.   :18.00   Max.   :63.0                           
##        region     parttime   
##  northeast:6441   no :25631  
##  midwest  :6863   yes: 2524  
##  south    :8760              
##  west     :6091              
##                              
## 
head(CPS1988$region)
## [1] northeast northeast northeast northeast northeast northeast
## Levels: northeast midwest south west
#changing column names
colnames(citydf) = c('CITY','TEMP')

Slicing Data Frames

# DF[row,col]
# leave a selection blank to get all of it
str(CPS1988, vec.len=1)
## 'data.frame':    28155 obs. of  7 variables:
##  $ wage      : num  355 ...
##  $ education : int  7 12 ...
##  $ experience: int  45 1 ...
##  $ ethnicity : Factor w/ 2 levels "cauc","afam": 1 1 ...
##  $ smsa      : Factor w/ 2 levels "no","yes": 2 2 ...
##  $ region    : Factor w/ 4 levels "northeast","midwest",..: 1 1 ...
##  $ parttime  : Factor w/ 2 levels "no","yes": 1 2 ...
# drop argument in slice retains DF structure
CPS1988$wage[1:8, drop= FALSE]
## [1] 354.94 123.46 370.37 754.94 593.54 377.23 284.90 561.13
CPS1988[["wage"]][1:8]
## [1] 354.94 123.46 370.37 754.94 593.54 377.23 284.90 561.13
CPS1988["wage"]
CPS1988[c("wage", "education")]
# Use $ to get individual columns out as vectors
CPS1988$wage[1:6]
## [1] 354.94 123.46 370.37 754.94 593.54 377.23
CPS1988$education[1:6]
## [1]  7 12  9 11 12 16
# We can use brackets as well, however since df's have 2 dimensions df[ROWS, COLUMNS]

# Returns a df, first 10 rows and columns 2 and 4
CPS1988[1:10, c(2,4)]
CPS1988[1:10, c('education', 'ethnicity')]
# All rows but for the educ and exper columns
CPS1988[c('education', 'experience')]
#CPS1988[ , c('education', 'experience')]
# If you get only one column, it will be a vector
class(CPS1988[, 1])
## [1] "numeric"
class(CPS1988[, 1:2])
## [1] "data.frame"
CPS1988["wage"]
CPS1988[1]

Lists

#As noted before, a list can contain any type of values. It is similar to a #dictionary
#in python, but is more exible in key-value pairs. This list contains a string and a vector: 
student <- list(Name="Peter",Classes = c("Bio101","Psych200","CS125"),GPA=2.5)
student
## $Name
## [1] "Peter"
##
## $Classes
## [1] "Bio101" "Psych200" "CS125"
##
## $GPA
## [1] 2.5
# Both lists and vectors are indexed types, but lists have more exible structures.
# In the example below, we break down each address in the vector address into
# components like street, city, and state (we will talk more about strsplit function
# in next class). Components from the same address need to be in the same
# element, and a list is the best t:
address=c('561 Felosa Drive, Los Angeles, CA 90017','493 Virginia Street, Chicago, IL 60629, USA','3522 Monroe Street, Apt 2, Houston, TX 77030, US')address_lst=strsplit(address,',')
address_lst
## [[1]]
## [1] "561 Felosa Drive" "Los Angeles" "CA 90017"
##
## [[2]]
## [1] "493 Virginia Street" "Chicago" "IL 60629"
## [4] "USA"
##
## [[3]]
## [1] "3522 Monroe Street" "Apt 2" "Houston"
## [4] "TX 77030" "US"

# We can access the elements in the list. However, we need to use double square
# brackets, because as we saw before, single square brackets are used for slicing
# into sub-lists.
address_lst[[3]] ; class(address_lst[[3]])
## [1] "3522 Monroe Street" "Apt 2" "Houston"
## [4] "TX 77030" "US"
## [1] "character"
myList = list(myVec, myMat, myDf)
myList
namedList = list(A=myVec, B=myMat, C=myDf)
namedList$D = pi

namedList['E'] = exp(1)

namedList

lapply

Since list elements are heterogeneous, it doesn’t make sense to apply operations elementwise. For example, we cannot apply an arithmetic operator to a sub-list:

#This way, apply functions can be used for vectorized operation (we will talk
#more about apply functions in later classes):
length(address_lst) # calculating the number of sub-lists
## [1] 3
## [[1]]
## [1] 3
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 5
lapply(address_lst,length) #calculating the number of elements f

Apply Functions

#If you want to implement a function f on each element of a vector v, write:
sapply(v, f)
#For example, to take the square root of each element in a vector:
v = 1:4
sapply(v, sqrt)
## [1] 1.000000 1.414214 1.732051 2.000000
#We can also use sapply to implement a function to the columns of a data frame,
#or the elements of a list.
#Since a data frame is really a collection of vectors (its columns), we can use
#sapply( ). The result is a vector with named elements:
sapply(iris[ ,1:4], mean)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
#(We extracted the rst four columns simply because the fth column - Species - is
#not numeric, so taking an average would not make sense.)
#Using the apply family here helps us to avoid writing this verbose code:
c(mean(iris$Sepal.Length), mean(iris$Sepal.Width),
mean(iris$Petal.Length), mean(iris$Petal.Width))
## [1] 5.843333 3.057333 3.758000 1.199333
#In addition to vectors and data frames, sapply can operate on lists. The result is
#a vector with named elements:
mylist = as.list(iris[ ,1:4])
# mylist
sapply(mylist, sd)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.8280661 0.4358663 1.7652982 0.7622377

#The lapply() function is similar to sapply but it returns the values in list form
#instead.
lapply(mylist, mean)

Strings

# Count characters
fruit = 'apple orange grape banana'
nchar(fruit)
## [1] 25
# split string into a list
fruit_char = 'apple orange grape banana'
fruit_list = strsplit(fruit_char, split=' ')
fruit_list
## [[1]]
## [1] "apple"  "orange" "grape"  "banana"
## [[1]]
## [1] "apple" "orange" "grape" "banana"
#Creating a vector from the returned list:
fruit_vec = fruit_list[[1]]
fruit_vec
## [1] "apple"  "orange" "grape"  "banana"
## [1] "apple" "orange" "grape" "banana"
# Or, alternatively, we can use unlist:
fruit_vec = unlist(fruit_list)
fruit_vec
## [1] "apple"  "orange" "grape"  "banana"
## [1] "apple" "orange" "grape" "banana"

substr

The substr function returns substrings of a character/string object:
fruit = ‘apple orange grape banana’

substr(fruit, 1, 5)
## [1] "apple"
## [1] "apple"
#Like many functions, substr extends element-wise to vectors (i.e., it's vectorized):
fruit_vec
## [1] "apple"  "orange" "grape"  "banana"
## [1] "apple" "orange" "grape" "banana"
substr(fruit_vec, 1, 3)
## [1] "app" "ora" "gra" "ban"
## [1] "app" "ora" "gra" "ban"

gsub

The gsub function replaces parts of a string:

fruit
## [1] "apple orange grape banana"
## [1] "apple orange grape banana"
gsub('apple','strawberry', fruit)
## [1] "strawberry orange grape banana"
## [1] "strawberry orange grape banana"
gsub('a','?',fruit)
## [1] "?pple or?nge gr?pe b?n?n?"
## [1] "?pple or?nge gr?pe b?n?n?"
gsub('an','HAA', fruit)
## [1] "apple orHAAge grape bHAAHAAa"
## [1] "apple orHAAge grape bHAAHAAa"

Extract the numerical dollar amounts from strings:

x_char = c('-$1,700.35','$2,400,800')
as.numeric(x_char)
## Warning: NAs introduced by coercion
## [1] NA NA
## Warning: NAs introduced by coercion
## [1] NA NA
gsub('\\$|,','', x_char)
## [1] "-1700.35" "2400800"
## [1] "-1700.35" "2400800"
as.numeric(gsub('\\$|,','', x_char))
## [1]   -1700.35 2400800.00
## [1] -1700.35 2400800.00

grep

Pattern matching; nd strings that contain a certain pattern in a vector:

fruit_vec
## [1] "apple"  "orange" "grape"  "banana"
## [1] "apple" "orange" "grape" "banana"
grep('grape', fruit_vec)
## [1] 3
## [1] 3
grep('a', fruit_vec)
## [1] 1 2 3 4
## [1] 1 2 3 4
grep('an', fruit_vec)
## [1] 2 4
## [1] 2 4

Paste

same as zip in python

#The paste function concatenates vector elements pairwise with the usual recycling rules, separated by the sep argument (default sep=' ').
year = c(2018, 2019, 2020)
month = c(10, 11, 12)
day = c(23, 24, 25)
paste(year, month, day, sep='-')
## [1] "2018-10-23" "2019-11-24" "2020-12-25"
## [1] "2018-10-23" "2019-11-24" "2020-12-25"
#If the collapse argument is provided, the resulting vector is concatenated into one string, with the collapse argument
#as separator:
url_vec = c("http://emollusks.myspecies.info","tinytax","get","8")
paste(url_vec, collapse='/')
## [1] "http://emollusks.myspecies.info/tinytax/get/8"
## [1] "http://emollusks.myspecies.info/tinytax/get/8"

Factors

Factors represent categorical values, a usually small number of discrete values that might be used as discriminators in statistical tests

#factor(v) makes vector v a factor; each distinct value becomes a level:
ranks = c(4, 1, 1, 4, 3, 3, 2, 3, 2, 4)
rank_factor = factor(ranks)
rank_factor
##  [1] 4 1 1 4 3 3 2 3 2 4
## Levels: 1 2 3 4
## [1] 4 1 1 4 3 3 2 3 2 4
## Levels: 1 2 3 4

levels(rank_factor)
## [1] "1" "2" "3" "4"
## [1] "1" "2" "3" "4"

#Levels can be renamed by assigning values directly to levels(factor).
vec1 = factor(rep(c(0,1,3), c(4,6,2)))
vec1
##  [1] 0 0 0 0 1 1 1 1 1 1 3 3
## Levels: 0 1 3
## [1] 0 0 0 0 1 1 1 1 1 1 3 3
## Levels: 0 1 3
levels(vec1) = c("male","female","male")
vec1
##  [1] male   male   male   male   female female female female female female
## [11] male   male  
## Levels: male female
## [1] male male male male female female female female female female
## [11] male male
## Levels: male female

#We sometimes want to change the order of the levels (e.g., their order affects how they will appear in bar graphs). This can be done with the ordered function:  
vec2 = factor(rep(c('b','a'), c(4,6)))
vec2
##  [1] b b b b a a a a a a
## Levels: a b
## [1] b b b b a a a a a a
## Levels: a b
levels(vec2)
## [1] "a" "b"
## [1] "a" "b"
ordered(vec2, levels=c('b','a'))
##  [1] b b b b a a a a a a
## Levels: b < a
## [1] b b b b a a a a a a
## Levels: b < a

#summary on character vector returns only the length of the vector:
x_char = c("a","b","a", NA)
summary(x_char)
##    Length     Class      Mode 
##         4 character character
## Length Class Mode
## 4 character character
#summary on factor returns level counts, including missing levels and NA:
x_fac = factor(x_char, levels = c("a","b","c"))
summary(x_fac)
##    a    b    c NA's 
##    2    1    0    1
## a b c NA's
## 2 1 0 1
x_fac
## [1] a    b    a    <NA>
## Levels: a b c
## [1] a b a <NA>
## Levels: a b c

cut

#The cut function converts a numeric vector to a factor, by matching numeric intervals to corresponding factor levels.
#Example transforming age values to corresponding age categories:
ages = c(30, 50, 80, 20, 100, 70)
age_breaks = c(0, 30, 70, 999)
age_categories = c("young","middle","old")
age_groups = cut(ages, breaks=age_breaks, labels=age_categories)
age_groups
## [1] young  middle old    young  old    middle
## Levels: young middle old
## [1] young middle old young old middle
## Levels: young middle old
class(ages)
## [1] "numeric"
## [1] "numeric"
class(age_groups)
## [1] "factor"
## [1] "factor"

Control Flow

IF

num = 5
# Note: %% is the Modulo (remainder) operation
if (num %% 2 != 0) {
  cat(num, 'is odd')
}
## 5 is odd
num = 5

if (num %% 2 != 0) {
  cat(num, 'is odd')
  # more commands
} else {
  cat(num, 'is even')
  # more commands
}
## 5 is odd
set.seed(0)
age = sample(0:100, 20, replace=TRUE)
age
##  [1] 13 67 38  0 33 86 42 13 81 58 50 96 84 20 53 73  6 72 78 84
res = ifelse(age > 70, 
             'old', 
             ifelse(age <= 30, 
                    'young', 
                    'mid'))
res
##  [1] "young" "mid"   "mid"   "young" "mid"   "old"   "mid"   "young" "old"  
## [10] "mid"   "mid"   "old"   "old"   "young" "mid"   "old"   "young" "old"  
## [19] "old"   "old"

FOR

library(readr)
sign_data = read.csv("TimesSquareSignage.csv", header=TRUE)
obs = nrow(sign_data)
for (i in 1:obs) {
  if (is.na(sign_data$Width[i])) {
    cat('WARNING: Missing width for sign no.', i, '\n')
  }
}

While

i = 1
while (i <= obs) {
  if (is.na(sign_data$Width[i])) {
    cat('WARNING: Missing width for sign no.', i, '\n')
  }
  i = i + 1
}

switch

If you have many conditions, you might want to consider the switch function, which comes in two versions.When the rst parameter to switch is a string, the function is called with keyword arguments. The rst parameter is matched with one of the keyword arguments:

ages = c(30, 50, 80, 20, 100, 70)
age_group = "middle"
# Get the ages in that age group
switch(age_group,
"young" = ages[ages <= 30],
"middle" = ages[ages <= 70 & ages > 30],
"old" = ages[ages > 70]
)
## [1] 50 70
## [1] 50 70

#If switch's rst argument is an integer, it just returns the corresponding following argument:
week_day = 2
# Get corresponding week day string (abbreviation)
switch(week_day,
"Sun",
"Mon",
"Tue",
"Wed",
"Thu",
"Fri",
"Sat"
)
## [1] "Mon"
## [1] "Mon"

Dplyr

case_when

case_when can be a better alternative than the nested ifelse approach because the code is more succinct, which makes it easier to write and read.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ages = c(30, 50, 80, 20, 100, 70)
age_groups = case_when(ages <= 30 ~ 'young',ages <= 70 ~ 'middle',ages > 70 ~ 'old')
age_groups
## [1] "young"  "middle" "old"    "young"  "old"    "middle"
## [1] "young" "middle" "old" "young" "old" "middle"

Tibble

births_df = read.csv("C:/Users/jmeis/RDA_054/Lecture 2/data/births.csv", stringsAsFactors=FALSE)
bnames_df = read.csv("C:/Users/jmeis/RDA_054/Lecture 2/data/bnames.csv", stringsAsFactors=FALSE)
births = as_tibble(births_df)
bnames = as_tibble(bnames_df)

Select, Filter, Arrange

bnames %>%
# can use negative index to exclude a column
# select (-name)
  select( year, name, prop) %>%
  filter( name == 'Justin') %>%
  arrange(desc(prop)) %>%
  head(2)
#df %>% select(contains('e'), starts_with('c'))

Select functions

starts_with(x, ignore.case=T): pick column names starting with x, case insensitive when ignore.case is (by default) set true.
ends_with(x, ignore.case=T): pick column names ending with x.
contains(x, ignore.case=T): pick column names containing x.
matches(x, ignore.case=T): pick columns whose names match x, where x is a regular expression.
one_of(name_1, name_2, ..., name_n): pick columns that have any of the names in the list. (The argument can also be a character vector.)
Can rename columns with select df %>% select(COLOR=color) ### Mutate, Transmute, Summarise

bnames %>%
  mutate(., newCol = year *2) %>%
  transmute(newCol = newCol/2) %>%
  summarise(sum = sum(newCol/n()))

Joins

x = data.frame(
  name = c("John", "Paul", "George", "Ringo", "Stuart", "Pete"),
  instrument = c("guitar","bass","guitar","drums","bass","drums"))
y = data.frame(
  name = c("John", "Paul", "George", "Ringo", "Brian"),
  band = c(TRUE, TRUE, TRUE, TRUE, FALSE))
left_join(x, y, by = "name")
right_join(x, y, by = "name")
inner_join(x, y, by = "name")
full_join(x, y, by="name")
semi_join(x, y, by = "name")
anti_join(x, y, by = "name")

Group_by

totals = bnames %>%
  group_by(name) %>%
  summarise(total = n())

head(totals)

Tidyr

What dataset will we consider a tidy dataset?
Your data will be easier to work with R if it follows three rules:
+ Each variable in the dataset is placed in its own column
+ Each observation is placed in its own row
+ Each value is placed in its own cell

readr::read_csv(filepath) # creates a tibble
readr::parse_number(<vector>) will extract numbers from string

pivot

pivot_longer() and pivot_shorter() were previously entitled gather and spread

pivot_longer(data, names_to = 'key', values_to = 'value')
pivot_wider(data, names_from = 'key', values_from = 'value')

example:

billboard2 <- billboard %>%
pivot_longer(c(wk1:wk76),names_to = "week",values_to="rank", values_drop_na = TRUE)
head(billboard2)
weather3 <- weather2 %>% pivot_wider(names_from = element,
values_from = value)

seperate and unite

seperate will seperate values in cells into seperate columns

seperate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE,convert = FALSE, extra = 'warn', fill = 'warn', ...)

unite will take values from multiple cells and put them in one column seperated by user input

unite(data, col, ..., sep = '_', remove = TRUE)

ggplot2

g <- ggplot(data = , aes(x = <name of default x variable), y = , …, ), )

# Basic scatter plot 
library(ggplot2)
g <- ggplot(data = mpg, aes(x = displ, y = hwy))
g + geom_point(aes(color = class))

# in the aes can use size or shape or alpha for additional variables

Regression line

ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Layering

g + geom_point(aes(color = class)) +
geom_point(aes(x=mean(displ), y=mean(hwy)), size=8)

### Faceting

g + geom_point() + facet_grid(. ~ cyl)

# or
g + geom_point() + facet_grid(cols = vars(cyl))

g + geom_point() + facet_grid(drv ~ .)

# or
g + geom_point() + facet_grid(rows = vars(drv))

g + geom_point() + facet_grid(drv ~ cyl)

# or
g + geom_point() + facet_grid(cols = vars(cyl), rows = vars(drv))

g + geom_point() + facet_wrap( ~ class)

# or
g + geom_point() + facet_wrap(vars(class))

Boxplot

g <- ggplot(data = mpg, aes(x = class, y = hwy))
g + geom_boxplot()

### forcats library to re order your variables

library(forcats)
ggplot(data = mpg, aes(x = fct_reorder(class, hwy), y = hwy)) + geom_boxplot()

### Barplots

ggplot(data = diamonds,aes(x = cut)) + geom_bar(aes(fill = cut))

### Histograms

g <- ggplot(data = diamonds, aes(x = carat))
g + geom_histogram(binwidth = 1)

can zoom in on a range

g <- ggplot(data = diamonds, aes(x = depth))
zoom <- coord_cartesian(xlim = c(55, 70))
g + geom_histogram(binwidth = 0.2) + zoom

putting a line over the histogram bars

ggplot(data = diamonds, aes(x = price)) +
facet_wrap(~ cut)+
geom_freqpoly(aes(color = cut),binwidth = 500)

or putting them on the same chart

ggplot(data = diamonds, aes(x = price)) +
geom_freqpoly(aes(color = cut),binwidth = 500)

### Saving plots

##ggsave(plotname.pdf)
##ggsave(plotname.png, width=6, height = 6)
## can use height and width argument in both

Statistics

One-Sample T-Test

Is used to determine if the sample mean is different from the population mean. + Assumptions
1. The population from which the sample is drawn is normally distributed.
2. Sample observations are randomly drawn and independent.

example: Did the average NYC income change during COVID

example:
You are told that the average height of a person in your building (mu ) is 68 inches; however, you think the average person is actually much taller.
For this scenario:
NullHypothesis(H0): mu = 68 inches AlternativeHypothesis(HA ):mu > 68 inches Upon collecting a random sample of independent height measurements of people in your building, you can calculate the t-statistic.
Suppose your data is as follows: You collected 100 measurements (n = 100 ). The average height, in inches, of your sample is 70 ( xBar = 70). The standard deviation of your sample is 1(s = 1 ).

set.seed(0)
heights = rnorm(n = 100, mean = 70, sd = 1) #Randomly generating 100 normally
                                            #distributed observations with a
                                            #mean of 70 and standard deviation
                                            #of 1.

plot(density(heights), main = "Sample Distribution of Heights")
abline(v = 70, lwd = 2, lty = 2)
abline(v = 68, lwd = 2, lty = 2, col = "red")
legend("topright", c("True Mean = 70", "H0 Mean = 68"), lwd = 2,
       lty = 2, col = c("black", "red"))

boxplot(heights, main = "Sample Distribution of Heights")
abline(h = 70, lwd = 2, lty = 2)
abline(h = 68, lwd = 2, lty = 2, col = "red")
legend("topright", c("True Mean = 70", "H0 Mean = 68"), lwd = 2,
       lty = 2, col = c("black", "red"))

Manually calculating T-statistic

t.statistic = (mean(heights) - 68)/(sd(heights)/sqrt(100)) #Manually calculating
t.statistic                                                #the t-statistic
## [1] 22.91586
                                                           #comparing to 68.

Manually calculating p-value

pt(q = t.statistic, df = 99, lower.tail = FALSE) #P-value is extremely small;
## [1] 1.138406e-41
                                                 #reject the null hypothesis
                                                 #in favor of the alternative.

using t.test function to perform one sided t-test

t.test(heights, mu = 68, alternative = "greater") #Same test, using the t.test()
## 
##  One Sample t-test
## 
## data:  heights
## t = 22.916, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 68
## 95 percent confidence interval:
##  69.87611      Inf
## sample estimates:
## mean of x 
##  70.02267
                                                  #function.

#The p-value for this test is extremely small (<.0005); thus, we reject the null hypothesis
#that the average height of individuals is 68 inches in favor of the alternative
#that the average height is greater than 68 inches at the 95% confidence level.

Two-Sample T-Test

  • T-test can also be used to examine the average difference between two samples drawn from two different populations.
  • Assumptions:
  • The populations from which the samples are drawn are normally distributed.
  • Sample observations are randomly drawn and independent.
  • The standard deviations of the two populations are equal. This one can be relaxed with a proper adjustment.

Example: Although the difficulty of the SAT should not vary across its different administrations, you believe that timing is everything; you suppose that there is a difference in the average score from tests taken in spring and fall.

  • \(H_{0} : \mu_{spring} = \mu_{fall}\)
  • \(H_{A} : \mu_{spring}\ne \mu_{fall}\)

Suppose your data is as follows:
+ You collected 180 scores \((n_{spring} = 100, n_{Fall} = 80)\).
The average score for spring 1550 was and for fall was 1550.
The standard deviation of your spring sample is 200; for fall, 210.

set.seed(0)
SAT.Spring = rnorm(100, 1550, 200) #Randomly generating 100 normally distributed
                                   #observations with a mean of 1550 and a
                                   #standard deviation of 200.
SAT.Fall = rnorm(80, 1500, 210) #Randomly generating 80 normally distributed
                                #observations with a mean of 1500 and a standard
                                #deviation of 210.

plot(density(SAT.Spring), xlab = "SAT Score",
     main = "Sample Distribution of SAT Scores", col = "red")
lines(density(SAT.Fall), col = "blue")
legend("topright", c("Spring", "Fall"), lwd = 1, col = c("red", "blue"))

boxplot(SAT.Spring, SAT.Fall, main = "Sample Distribution of SAT Scores",
        col = c("red", "blue"), names = c("Spring", "Fall"))

Manually calculating the t-statistic.

t.statistic = (mean(SAT.Spring) - mean(SAT.Fall))/sqrt(var(SAT.Spring)/100 + var(SAT.Fall)/80)
t.statistic
## [1] 2.537598

Conduct t-test

t.test(SAT.Spring, SAT.Fall, alternative = "two.sided") 
## 
##  Welch Two Sample t-test
## 
## data:  SAT.Spring and SAT.Fall
## t = 2.5376, df = 154.92, p-value = 0.01215
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   16.44063 131.97461
## sample estimates:
## mean of x mean of y 
##  1554.534  1480.326
#The p-value is 0.01215, which is less than our threshold of 0.025; thus, we reject
#the null hypothesis that the average score on the spring SAT is the same as the
#fall SAT. We conclude in favor of the alternative that the average score of the
#two test administrations is different at the 95% confidence level.

F-Test

F-Test is applied to assess whether the variances of two different populations are equal.
Assumptions:
+ The populations from which the samples are drawn are normally dist. ➢ + Sample observations are randomly drawn and independent.

Example:
+ When we tested the diculty of SAT exams in the previous example using the Two Sample T-Test, we should have had equal variances; however, the variances were slightly different. Were they signicantly different?
+ For this scenario:
- Null Hypothesis + \(H_{0} : \sigma^2_{spring} = \sigma^2_{fall}\) - Alternative Hypothesis \(H_{A} : \sigma^2_{spring}\ne \sigma^2_{fall}\)

f.statistic = var(SAT.Fall)/var(SAT.Spring) #Manually calculating the F-statistic.
f.statistic
## [1] 1.395352
var.test(SAT.Fall, SAT.Spring, alternative = "two.sided")
## 
##  F test to compare two variances
## 
## data:  SAT.Fall and SAT.Spring
## F = 1.3954, num df = 79, denom df = 99, p-value = 0.1161
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.9206019 2.1363341
## sample estimates:
## ratio of variances 
##           1.395352

One-Way ANOVA

We use One-Way ANOVA to assess the equality of means of two or more groups. NB: When there are exactly two groups, this is equivalent to a Two Sample T-Test. Notice that ANOVA stands for analysis of variance but the purpose is to compare the means. It’s called so because whether the difference in means is significant is decided through investigation on the variances.
Assumptions:
+ The populations from which the samples are drawn are normally dist.
+ The standard deviations of the populations are equal.
+ Sample observations are randomly drawn and independent.

  • Calculate F statistic given by: MS = Mean Square \(F^* = MS_{BetweenGroups}/(MS_{WithinGroups}) \~ F_{k-1, N-k}\)

Example:
You desire to test the efficacy of different types of diets on weight loss: low calorie, low carbohydrate, and low fat. You also have a control group for comparison purposes.
For this scenario:
+ Null Hypothesis \(H_{0} : \mu_{LowCalorie} = \mu_{LowCarb} = \mu_{lowFat} = \mu_{Control}\)
+ Alternative Hypothesis \(H_{A} :\) At least one of the average amounts of weight loss differs from the others.

Randomly generate weight loss measurements for various diet types

set.seed(0)
Low.Calorie = rnorm(200, 10, 1) #Randomly generating weight loss measurements
Low.Carb = rnorm(200, 8.5, 1)   #for various diet types.
Low.Fat = rnorm(200, 8, 1)
Control = rnorm(200, 0, 1)

Weight.Loss = c(Low.Calorie, Low.Carb, Low.Fat, Control) #Combining data into
Category = c(rep("Low Calorie", 200),                    #different consolidated
             rep("Low Carb", 200),                       #vectors.
             rep("Low Fat", 200),
             rep("Control", 200))

boxplot(Weight.Loss ~ Category,
        col = c("red", "orange", "yellow", "green"),
        main = "Distribution of Weight Loss\nfor Various Diets")

Calculate ANOVA

summary(aov(Weight.Loss ~ Category))
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Category      3  12061    4020    4202 <2e-16 ***
## Residuals   796    762       1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Test of Independence

\(\chi^2\) Test of Independence for testing categorical variables
Example:
+ A review session was held before a quiz was administered to a class of students. You would like to determine whether a student’s grade on the quiz is dependent on whether or not they attended the review session.
For this scenario:
- Null Hypothesis \((H_{0})\): The two variables are independent of one another.
- Alternative Hypothesis \(H_{A} :\) The two variables are dependent on each other.
+ Upon collecting independent samples on whether or not students went to the review session and whether or not they passed the quiz, you can calculate the -statistic.

quiz.data = matrix(c(44, 21, 12, 18), nrow = 2, ncol = 2, byrow = TRUE)
dimnames(quiz.data) = list(Attendance = c("Present", "Absent"),
                           Grade = c("Pass", "Fail"))

chisq.test(quiz.data)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  quiz.data
## X-squared = 5.4106, df = 1, p-value = 0.02001

Simple Linear Regression

Formulas

for a linear relationship between x and y
\[ Y\approx\beta_{0} + \beta_{1}X \] \(\beta_{1}\) is the strength of relationship between x and y
The prediction for y based on the value for X:
\[ \hat{y_{i}} = \hat{\beta_{0}} + \hat{\beta_{1}}x_{i} \] the difference between the ith observed response and the predicted response or residual or error \(e_{i}:\)
\[ e_{i} = y_{i} - \hat{y_{i}} \] RSS the residual sum of squares \[ RSS =\sum_{i = 1}^{n}(y_{i} - \hat{\beta_{0}} - \hat{\beta_{1}}x_{i})^2 \] Standard errors

\[ \hat{SE}(\hat{\beta_{0}})^2 = \sigma^2[\frac{1}{n}\frac{\bar{x}^2}{\sum_{i=1}^{n}(x_i-\bar{x})^2}] \] where: + \(\sigma\) is the standard deviation of the residuals + \(n\) is the size of the sample set

Residual Standard Error \(\sigma\) can be estimated by: \[ \hat{\sigma} = RSE = \sqrt{\frac{RSS}{n-2}} \]

Hypothesis Tests: For SLR the principal hypothesis test is: + Null Hypothesis: \((H_{0}): \beta_{1} = 0\) + Alternative Hypothesis \((H_{A}): \beta_{1} \ne 0\)

Can calculate T statistic as follows: \[ t^* = \frac{\hat{\beta_{1}} - 0}{SE(\hat{\beta_{1}})} ~ t_{n-2} \] As we have previously seen, this will yield a t-value that we use to calculate the area under the associated theoretical distribution to assess the probability (p- value) of observing results at least as extreme as our own.
Should the p-value be less than a certain threshold (< 0.05), we reject the null hypothesis in favor of the alternative.

The F-Test:
\[ F^* = \frac{RSS-TSS}{\frac{RSS}{(n-2)}} ~ F_{1,n-2} \] Where the total sum of squares TSS is defined:
In other words, the error from a fit to a constant model
\[ TSS = \sum_{i=1}^{n}(y_{i} - \bar{y})^2 \]

Assumptions of Simple Linear Regression + Linearity + Constant Variance + Normality + Independent Errors

Box-Cox Transformation

Given a value of \(\lambda\), the Box_Cox is defined: \[ y\prime = \begin{cases} \frac{y^{\lambda}-1}{\lambda} & \lambda \ne 0 \\ log(y) & \lambda = 0 \end{cases} \]

The Box-Cox procedure iterates along values of maximum likelihood so as to maximize the t of the transformed variable to a normal distribution. The Box-Cox transformation does not guarantee that the transformed data will be normally distributed. Why?
+ The method is actually attempting to minimize the standard deviation with respect to the choice of λ.
+ We assume that when the standard deviation is smallest, the transformed data has the highest likelihood of being normally distributed.
The Box-Cox transformation can only be used on positive values. What if we have negative values in our data?
+ Shift all values in your dataset up by a constant that renders all values positive,then apply the Box-Cox transformation procedure.

Checking Fit of Model + Smaller RSS and RSE indicate better fit

Coefficient of Determination: \[ R^2 = \frac{TSS - RSS}{TSS} = 1-\frac{RSS}{TSS} \] a higher \(R^2\) indicates a better fit, is always bound between 0 and 1

Example With Cars database

use the lm() function to conduct simple linear regression

model = lm(dist ~ speed, data = cars)
model
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932
summary(model)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

#-The five number summary of the residuals.
#-The coefficient estimates.
#-The coeffiient standard errors.
#-The t-test for significance of the coefficient estimates.
#-The p-values for the significance tests.
#-The level of significance.
#-The RSE and degrees of freedom for the model.
#-The coefficient of determination, R^2.
#-The overall model F-statistic and corresponding p-value.

t.statistic = 9.464
f.statistic = 89.57
t.statistic^2
## [1] 89.5673
confint(model)
##                  2.5 %    97.5 %
## (Intercept) -31.167850 -3.990340
## speed         3.096964  4.767853

Checking assumptions with the cars datase

Linearity

plot(cars, xlab = "Speed in MPH", ylab = "Distance in Feet",
     main = "Scatterplot of Cars Dataset")
abline(model, lty = 2)

Constant Variance & Independent Errors

plot(model$fitted, model$residuals,
     xlab = "Fitted Values", ylab = "Residual Values",
     main = "Residual Plot for Cars Dataset")
abline(h = 0, lty = 2)

Normality

qqnorm(model$residuals)
qqline(model$residuals)

plot(model)

library(car) #Companion to applied regression.
influencePlot(model)

Predicting New Observations

model$fitted.values #Returns the fitted values.
##         1         2         3         4         5         6         7         8 
## -1.849460 -1.849460  9.947766  9.947766 13.880175 17.812584 21.744993 21.744993 
##         9        10        11        12        13        14        15        16 
## 21.744993 25.677401 25.677401 29.609810 29.609810 29.609810 29.609810 33.542219 
##        17        18        19        20        21        22        23        24 
## 33.542219 33.542219 33.542219 37.474628 37.474628 37.474628 37.474628 41.407036 
##        25        26        27        28        29        30        31        32 
## 41.407036 41.407036 45.339445 45.339445 49.271854 49.271854 49.271854 53.204263 
##        33        34        35        36        37        38        39        40 
## 53.204263 53.204263 53.204263 57.136672 57.136672 57.136672 61.069080 61.069080 
##        41        42        43        44        45        46        47        48 
## 61.069080 61.069080 61.069080 68.933898 72.866307 76.798715 76.798715 76.798715 
##        49        50 
## 76.798715 80.731124
newdata = data.frame(speed = c(15, 20, 25)) #Creating a new data frame to pass
                                            #to the predict() function.

predict(model, newdata, interval = "confidence") #Construct confidence intervals
##        fit      lwr      upr
## 1 41.40704 37.02115 45.79292
## 2 61.06908 55.24729 66.89088
## 3 80.73112 71.59608 89.86617
                                                 #for the average value of an
                                                 #outcome at a specific point.

#Constructing confidence and prediction bands for the scope of our data.

newdata = data.frame(speed = 4:25)
conf.band = predict(model, newdata, interval = "confidence")
pred.band = predict(model, newdata, interval = "prediction")

Visualizing the confidence and prediction bands.

plot(cars, xlab = "Speed in MPH", ylab = "Distance in Feet",
     main = "Scatterplot of Cars Dataset")
abline(model, lty = 2) #Plotting the regression line.
lines(newdata$speed, conf.band[, 2], col = "blue") #Plotting the lower confidence band.
lines(newdata$speed, conf.band[, 3], col = "blue") #Plotting the upper confidence band.
lines(newdata$speed, pred.band[, 2], col = "red") #Plotting the lower prediction band.
lines(newdata$speed, pred.band[, 3], col = "red") #Plotting the upper prediction band.
legend("topleft", c("Regression Line", "Conf. Band", "Pred. Band"),
       lty = c(2, 1, 1), col = c("black", "blue", "red"))

The Box-Cox Transformation

library(car)
bc = boxCox(model) #Automatically plots a 95% confidence interval for the lambda

                   #value that maximizes the likelihhood of transforming to
                   #normality.
lambda = bc$x[which(bc$y == max(bc$y))] #Extracting the best lambda value.

dist.bc = (cars$dist^lambda - 1)/lambda #Applying the Box-Cox transformation.

model.bc = lm(dist.bc ~ cars$speed) #Creating a new regression based on the
                                    #transformed variable.
summary(model.bc) #Assessing the output of the new model.
## 
## Call:
## lm(formula = dist.bc ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0926 -1.0444 -0.3055  0.7999  4.7520 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.08227    0.73856   1.465    0.149    
## cars$speed   0.49541    0.04541  10.910 1.35e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.681 on 48 degrees of freedom
## Multiple R-squared:  0.7126, Adjusted R-squared:  0.7066 
## F-statistic:   119 on 1 and 48 DF,  p-value: 1.354e-14
plot(model.bc) #Assessing the assumptions of the new model.

Multiple Linear Regression

\(Y \approx \beta_0 + \beta_1X_1 + \beta_2X_2 +...+ \beta_pX_p\)
in order to regress \(Y\) onto \(X_i\), we need to estimate \((p+1)\) variables

variables with a hat on them are estimated values

the difference between observed response and actual value is the residual
or error
\(e_i = y_i - \hat{y_i}\)

RSS is the residual sum of squares, which is the sum of the square of all residuals
\(\sum_{i =1}^{n}e_i^2\)

can use alternate notation:
denote X as the n by (p+1) matrix with each row as an observation in our dataset; note that the first column will, by default, be a vector of 1’s as a placeholder for the \(\beta_0\) intercept term.
Denote y as the n by 1 vector with each entry representing the response values in our dataset
can rewrite RSS as:
\(RSS(\beta) = (y-X\beta)^T(y-X\beta)\)
this is a quadratic function with (p+1) parameters

The point of regression is to find the minimum RSS so we can take the derivative and solve for beta:

\(\beta = (X^TX)^{-1}X^Ty\)

Assumptions of MLR

  • Linearity
  • Constant Variance - standard deviation of epsilon will be constant
  • Normality - Noise is normally distributed (epsilon is noise in each observation)
  • Independent errors

Multicollinearity

  • The assumption of multicollinearity describes the relationships among the independent variables in our model. To satisfy this assumption, we need to verify that our independent variables are uncorrelated with each other. If they are not, our coefficient estimates could be unstable.
  • Can check by bivariate analysis (graph the independent variables against eachother)
  • After MLR fit, inspect variance inflation for each predictor

Why Colinearity is bad
+ creates redundancies
+ difficult to isolate influence on Y from two dependent variables
+ standard error of estimates will be inflated
+ reliability of regression coefficients will decrease

Variance Inflation Factor

for a 2 predictor model
\(VIF_{x_1x_2} = \frac{1}{1-p_{1,2}^2}\)
where p is the dependence. to put it another way
if we have 3 variables \(x_1, x_2, x_3\) that are linearly related to eachother, we can use a linear model \(linear(x_1 \approx x_2, x_3)\) where we find an \(R^2\) value (the coefficient of determination) for \(x_2 and x_3\) influence on \(x_1\) and then
\(VIF = \frac{1}{1-R^2}\)
predictors that have a VIF of 5 or greater need to be dealt with and possibly dropped form the model

The Overall F-Test

Used to determine if any variable provides us with information + Null Hypothesis \(H_0: \beta_1 = \beta_2 = ... = \beta_p = 0\)
+ Alt hypothesis \(H_A:\) At least one \(\beta\) us not equal to zero

\(F^* \frac{\frac{(TSS - RSS)}{p}}{\frac{RSS}{n-p-1}}\approx F_{p,n-p-1}\)
TSS is the RSS of the null model (which is a straight line of the average of y)
p is the number of total variables

  • If there is no relationship between response and predictors expect F* values close to 1
  • If there is a relationship we should expect to see F* values greater than 1

Partial F-Test

Can ask is a subset of the variables provide information

Step-wise Feature Selection

If we have 100 variables that truely have no association with the response variable, repeatedly performing an F test on a subset of variable swill produce a type I error 5% of the time leading to false correlation

Akaike Information Criterion
\(AIC = -2ln(L) + 2p\)
L is the ‘likelyhood’
p is the number of estimated parameters
Smaller AIC are better fits

Bayesian Information Criterion
\(BIC = -2ln(L) + p * ln(n)\)
L is the ‘likelyhood’
p is the number of estimated parameters
n is the number of observations

BIC model is penalized more on complexity of model

Choose AIC if the primary goal of MLR is prediction Choose BIC if the goal of MLR is description

Procedures + if we have p parameters then there are \(2^p\) models we could create.
+ Using: - Forward Selection
- Backward selection
- Both Selection

Forward Selection
+ Begin with a model that only has an intercept term, then sequentially add the predictor that most improves the fit (based on AIC, BIC, etc.).
Backward Selection + Begin with a model that includes all parameter terms, then sequentially remove the predictor that has the least impact on the fit.
Both Selection + Begin with either a model that only has an intercept term or a model that includes all parameter terms, then sequentially add or remove the predictor that that has the most/least impact on the model, respectively.

Evaluation Prediction and Testing

Residual Standard Error is the standard deviation of the residuals about the regression surface \(\hat{\sigma} = RSE = \sqrt{\frac{RSS}{n-p-1}}\)
Coefficient of Determination proportion of total sample variability in the dependent variable that is explained by the regression model
\(R^2 = \frac{TSS-RSS}{TSS} = 1-\frac{RSS}{TSS}\)
+ For MLR each time a predictor is added to the model \(R^2\) increases, even if it is just modeling noise, produces artificially high \(R^2\)
+ \(R^2\) does not take into account model complexity, could result in overfitting Adjusted R^2 \(R_{adj}^2 = 1- \frac{\frac{RSS}{n-p-1}}{\frac{TSS}{n-1}}\)
+ It can be shown that the addition of predictor variables to a model only leads to an increase in R² adjusted if the corresponding F-test exceeds 1 (and is therefore statistically significant)

Prediction
Once we have the \(\beta\)’s calculated we simply plug back into the linear equation to get \(\hat{y}\)

Categorical Features
say we have a variable that takes a one of three states NY, FL, or CA
use dummification or One Hot Encoding
create a new column for each value
\[ X_{NY} = \begin{cases} 0 & i^{th} obs. \ne NY\\ 1 & i^{th} obs. = NY \end{cases} X_{FL} = \begin{cases} 0 & i^{th} obs. \ne FL\\ 1 & i^{th} obs. = FL \end{cases} X_{CA} = \begin{cases} 0 & i^{th} obs. \ne CA\\ 1 & i^{th} obs. = CA \end{cases} \] actually only need n-1 columns. In our example we only need two additional columns. Must remove one of the categorical columns or the matrix can not be transposed

Example

Suppose we want to predict the Price of gas using a simple model with just the State dummy variables; we will choose NY to be our baseline:
\(Price = \beta_0 + \beta_{FL}X_{FL} + \beta_{CA}X_{CA}\)
in this model:
+ \(B_0\) is the average gas price in NY
+ \(\beta_{FL}\) is the average gas price difference between florida and NY
+ \(\beta_{CA}\) is the average gas price difference between California and NY

Code Example

Develop a predictive model for life expectancy

states = as.data.frame(state.x77)
colnames(states)[4] = "Life.Exp"
colnames(states)[6] = "HS.Grad"

Creating a population density variable.

states[,9] = (states$Population*1000)/states$Area
colnames(states)[9] = "Density"

Numerical EDA for dataset

summary(states)
##    Population        Income       Illiteracy       Life.Exp    
##  Min.   :  365   Min.   :3098   Min.   :0.500   Min.   :67.96  
##  1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625   1st Qu.:70.12  
##  Median : 2838   Median :4519   Median :0.950   Median :70.67  
##  Mean   : 4246   Mean   :4436   Mean   :1.170   Mean   :70.88  
##  3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575   3rd Qu.:71.89  
##  Max.   :21198   Max.   :6315   Max.   :2.800   Max.   :73.60  
##      Murder          HS.Grad          Frost             Area       
##  Min.   : 1.400   Min.   :37.80   Min.   :  0.00   Min.   :  1049  
##  1st Qu.: 4.350   1st Qu.:48.05   1st Qu.: 66.25   1st Qu.: 36985  
##  Median : 6.850   Median :53.25   Median :114.50   Median : 54277  
##  Mean   : 7.378   Mean   :53.11   Mean   :104.46   Mean   : 70736  
##  3rd Qu.:10.675   3rd Qu.:59.15   3rd Qu.:139.75   3rd Qu.: 81163  
##  Max.   :15.100   Max.   :67.30   Max.   :188.00   Max.   :566432  
##     Density        
##  Min.   :  0.6444  
##  1st Qu.: 25.3352  
##  Median : 73.0154  
##  Mean   :149.2245  
##  3rd Qu.:144.2828  
##  Max.   :975.0033
sapply(states, sd)
##   Population       Income   Illiteracy     Life.Exp       Murder      HS.Grad 
## 4.464491e+03 6.144699e+02 6.095331e-01 1.342394e+00 3.691540e+00 8.076998e+00 
##        Frost         Area      Density 
## 5.198085e+01 8.532730e+04 2.210063e+02
cor(states)
##             Population     Income   Illiteracy    Life.Exp     Murder
## Population  1.00000000  0.2082276  0.107622373 -0.06805195  0.3436428
## Income      0.20822756  1.0000000 -0.437075186  0.34025534 -0.2300776
## Illiteracy  0.10762237 -0.4370752  1.000000000 -0.58847793  0.7029752
## Life.Exp   -0.06805195  0.3402553 -0.588477926  1.00000000 -0.7808458
## Murder      0.34364275 -0.2300776  0.702975199 -0.78084575  1.0000000
## HS.Grad    -0.09848975  0.6199323 -0.657188609  0.58221620 -0.4879710
## Frost      -0.33215245  0.2262822 -0.671946968  0.26206801 -0.5388834
## Area        0.02254384  0.3633154  0.077261132 -0.10733194  0.2283902
## Density     0.24622789  0.3299683  0.009274348  0.09106176 -0.1850352
##                HS.Grad        Frost        Area      Density
## Population -0.09848975 -0.332152454  0.02254384  0.246227888
## Income      0.61993232  0.226282179  0.36331544  0.329968277
## Illiteracy -0.65718861 -0.671946968  0.07726113  0.009274348
## Life.Exp    0.58221620  0.262068011 -0.10733194  0.091061763
## Murder     -0.48797102 -0.538883437  0.22839021 -0.185035233
## HS.Grad     1.00000000  0.366779702  0.33354187 -0.088367214
## Frost       0.36677970  1.000000000  0.05922910  0.002276734
## Area        0.33354187  0.059229102  1.00000000 -0.341388515
## Density    -0.08836721  0.002276734 -0.34138851  1.000000000

Creating a saturated model (a model with all variables included).

model.saturated = lm(Life.Exp ~ ., data = states)

look at the model summary

summary(model.saturated)
## 
## Call:
## lm(formula = Life.Exp ~ ., data = states)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47514 -0.45887 -0.06352  0.59362  1.21823 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.995e+01  1.843e+00  37.956  < 2e-16 ***
## Population   6.480e-05  3.001e-05   2.159   0.0367 *  
## Income       2.701e-04  3.087e-04   0.875   0.3867    
## Illiteracy   3.029e-01  4.024e-01   0.753   0.4559    
## Murder      -3.286e-01  4.941e-02  -6.652 5.12e-08 ***
## HS.Grad      4.291e-02  2.332e-02   1.840   0.0730 .  
## Frost       -4.580e-03  3.189e-03  -1.436   0.1585    
## Area        -1.558e-06  1.914e-06  -0.814   0.4205    
## Density     -1.105e-03  7.312e-04  -1.511   0.1385    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7337 on 41 degrees of freedom
## Multiple R-squared:  0.7501, Adjusted R-squared:  0.7013 
## F-statistic: 15.38 on 8 and 41 DF,  p-value: 3.787e-10

Low P values signify high significance in our model, Murder, and Population both fall under the 5% threshold.
\(R^2\) is about 75%

Assessing the assumptions of the model.

plot(model.saturated)

  • residual plot should center on zero and evenly distributed around zero, we don’t see grouping and variance seems evenly distributed
  • QQ plot shows how normal the residuals are, outrs seems to fit
  • Hawaii and Alaska is very close to cooks distance, Alaska could be an outlier.

lets check the VIF of the variables

vif(model.saturated)
## Population     Income Illiteracy     Murder    HS.Grad      Frost       Area 
##   1.634268   3.275340   5.475516   3.028169   3.229221   2.501893   2.429078 
##    Density 
##   2.377250

VIF of illiteracy is very high, meaning the information included in that variable are included in other predictors

remove illiteracy from model

model2 = lm(Life.Exp ~ . - Illiteracy, data = states)
summary(model2)
## 
## Call:
## lm(formula = Life.Exp ~ . - Illiteracy, data = states)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47618 -0.38592 -0.05728  0.58817  1.42334 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.098e+01  1.228e+00  57.825  < 2e-16 ***
## Population   5.675e-05  2.789e-05   2.034   0.0483 *  
## Income       1.901e-04  2.883e-04   0.659   0.5133    
## Murder      -3.122e-01  4.409e-02  -7.081 1.11e-08 ***
## HS.Grad      3.652e-02  2.161e-02   1.690   0.0984 .  
## Frost       -6.059e-03  2.499e-03  -2.425   0.0197 *  
## Area        -8.638e-07  1.669e-06  -0.518   0.6075    
## Density     -8.612e-04  6.523e-04  -1.320   0.1939    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7299 on 42 degrees of freedom
## Multiple R-squared:  0.7466, Adjusted R-squared:  0.7044 
## F-statistic: 17.68 on 7 and 42 DF,  p-value: 1.12e-10

certain variables are now more significant, Frost, in particular

check VIF again

vif(model2)
## Population     Income     Murder    HS.Grad      Frost       Area    Density 
##   1.426596   2.887415   2.437296   2.801423   1.551937   1.865654   1.911672

all VIFs decreased

do a partial f-test compare model2 to model.saturated Null Hypothesis will be: They are not different (\(\beta_{illiteracy} = 0\))

anova(model2, model.saturated)

P value is high, so we do not reject the null hypothesis
meaning we can use either model

The MASS library contains a step-wise feature selection to gradually add or remove predictors from model automatically

# create the null model
model.empty = lm(Life.Exp ~ 1, data = states)
# create a model with all variables
model.full = lm(Life.Exp ~ ., data = states)
# a list of empty model to full model
scope = list(lower = formula(model.empty), upper = formula(model.full))
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# step function(start point, scope, direction, k=2 AIC or k = log(50) = BIC)
forwardAIC = step(model.empty, scope, direction = 'forward', k=2)
## Start:  AIC=30.44
## Life.Exp ~ 1
## 
##              Df Sum of Sq    RSS     AIC
## + Murder      1    53.838 34.461 -14.609
## + Illiteracy  1    30.578 57.721  11.179
## + HS.Grad     1    29.931 58.368  11.737
## + Income      1    10.223 78.076  26.283
## + Frost       1     6.064 82.235  28.878
## <none>                    88.299  30.435
## + Area        1     1.017 87.282  31.856
## + Density     1     0.732 87.567  32.019
## + Population  1     0.409 87.890  32.203
## 
## Step:  AIC=-14.61
## Life.Exp ~ Murder
## 
##              Df Sum of Sq    RSS     AIC
## + HS.Grad     1    4.6910 29.770 -19.925
## + Population  1    4.0161 30.445 -18.805
## + Frost       1    3.1346 31.327 -17.378
## + Income      1    2.4047 32.057 -16.226
## <none>                    34.461 -14.609
## + Area        1    0.4697 33.992 -13.295
## + Illiteracy  1    0.2732 34.188 -13.007
## + Density     1    0.2609 34.200 -12.989
## 
## Step:  AIC=-19.93
## Life.Exp ~ Murder + HS.Grad
## 
##              Df Sum of Sq    RSS     AIC
## + Frost       1    4.3987 25.372 -25.920
## + Population  1    3.3405 26.430 -23.877
## <none>                    29.770 -19.925
## + Illiteracy  1    0.4419 29.328 -18.673
## + Area        1    0.2775 29.493 -18.394
## + Income      1    0.1022 29.668 -18.097
## + Density     1    0.0037 29.767 -17.932
## 
## Step:  AIC=-25.92
## Life.Exp ~ Murder + HS.Grad + Frost
## 
##              Df Sum of Sq    RSS     AIC
## + Population  1   2.06358 23.308 -28.161
## <none>                    25.372 -25.920
## + Income      1   0.18232 25.189 -24.280
## + Illiteracy  1   0.17184 25.200 -24.259
## + Density     1   0.06419 25.307 -24.046
## + Area        1   0.02573 25.346 -23.970
## 
## Step:  AIC=-28.16
## Life.Exp ~ Murder + HS.Grad + Frost + Population
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    23.308 -28.161
## + Density     1   0.66005 22.648 -27.598
## + Income      1   0.00606 23.302 -26.174
## + Illiteracy  1   0.00392 23.304 -26.170
## + Area        1   0.00079 23.307 -26.163
# backwardAIC = step(model.full, scope, direction = "backward", k = 2)
# bothAIC.empty = step(model.empty, scope, direction = "both", k = 2)
# bothAIC.full = step(model.full, scope, direction = "both", k = 2)

for BIC

forwardBIC = step(model.empty, scope, direction = "forward", k = log(50))
## Start:  AIC=32.35
## Life.Exp ~ 1
## 
##              Df Sum of Sq    RSS     AIC
## + Murder      1    53.838 34.461 -10.785
## + Illiteracy  1    30.578 57.721  15.004
## + HS.Grad     1    29.931 58.368  15.561
## + Income      1    10.223 78.076  30.107
## <none>                    88.299  32.347
## + Frost       1     6.064 82.235  32.702
## + Area        1     1.017 87.282  35.680
## + Density     1     0.732 87.567  35.843
## + Population  1     0.409 87.890  36.027
## 
## Step:  AIC=-10.79
## Life.Exp ~ Murder
## 
##              Df Sum of Sq    RSS      AIC
## + HS.Grad     1    4.6910 29.770 -14.1894
## + Population  1    4.0161 30.445 -13.0687
## + Frost       1    3.1346 31.327 -11.6415
## <none>                    34.461 -10.7852
## + Income      1    2.4047 32.057 -10.4900
## + Area        1    0.4697 33.992  -7.5593
## + Illiteracy  1    0.2732 34.188  -7.2712
## + Density     1    0.2609 34.200  -7.2532
## 
## Step:  AIC=-14.19
## Life.Exp ~ Murder + HS.Grad
## 
##              Df Sum of Sq    RSS     AIC
## + Frost       1    4.3987 25.372 -18.271
## + Population  1    3.3405 26.430 -16.228
## <none>                    29.770 -14.189
## + Illiteracy  1    0.4419 29.328 -11.025
## + Area        1    0.2775 29.493 -10.746
## + Income      1    0.1022 29.668 -10.449
## + Density     1    0.0037 29.767 -10.284
## 
## Step:  AIC=-18.27
## Life.Exp ~ Murder + HS.Grad + Frost
## 
##              Df Sum of Sq    RSS     AIC
## + Population  1   2.06358 23.308 -18.601
## <none>                    25.372 -18.271
## + Income      1   0.18232 25.189 -14.720
## + Illiteracy  1   0.17184 25.200 -14.699
## + Density     1   0.06419 25.307 -14.486
## + Area        1   0.02573 25.346 -14.410
## 
## Step:  AIC=-18.6
## Life.Exp ~ Murder + HS.Grad + Frost + Population
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    23.308 -18.601
## + Density     1   0.66005 22.648 -16.125
## + Income      1   0.00606 23.302 -14.702
## + Illiteracy  1   0.00392 23.304 -14.697
## + Area        1   0.00079 23.307 -14.691
# backwardBIC = step(model.full, scope, direction = "backward", k = log(50))
# bothBIC.empty = step(model.empty, scope, direction = "both", k = log(50))
# bothBIC.full = step(model.full, scope, direction = "both", k = log(50))

All models yeild the same result, Murder, HS.Grad and frost and population variables

lets look at the model

summary(forwardAIC)
## 
## Call:
## lm(formula = Life.Exp ~ Murder + HS.Grad + Frost + Population, 
##     data = states)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47095 -0.53464 -0.03701  0.57621  1.50683 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.103e+01  9.529e-01  74.542  < 2e-16 ***
## Murder      -3.001e-01  3.661e-02  -8.199 1.77e-10 ***
## HS.Grad      4.658e-02  1.483e-02   3.142  0.00297 ** 
## Frost       -5.943e-03  2.421e-03  -2.455  0.01802 *  
## Population   5.014e-05  2.512e-05   1.996  0.05201 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7197 on 45 degrees of freedom
## Multiple R-squared:  0.736,  Adjusted R-squared:  0.7126 
## F-statistic: 31.37 on 4 and 45 DF,  p-value: 1.696e-12

look at graphical representation

plot(forwardAIC)

Now we can make predictions

# to show fitted values
forwardAIC$fitted.value
##        Alabama         Alaska        Arizona       Arkansas     California 
##       68.48112       69.85740       71.41416       69.57374       71.79565 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##       71.10354       72.03459       71.12647       70.61539       68.63694 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##       72.09317       71.49989       70.19244       70.90159       72.39653 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##       71.90352       69.24418       69.15045       71.86095       70.51852 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##       72.44105       69.86893       72.26560       69.00535       70.10610 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##       71.40025       72.17032       69.52482       71.72636       71.59612 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##       70.03119       70.62937       69.28624       71.87649       71.08549 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##       71.15860       72.41445       71.38046       71.76007       69.06109 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##       72.01161       69.46583       69.97886       72.05753       71.06135 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##       70.14691       72.68272       70.44983       72.00996       70.87679
# add new data
newdata = data.frame(Murder = c(1.5, 7.5, 12.5),
                     HS.Grad = c(60, 50, 40),
                     Frost = c(75, 55, 175),
                     Population = c(7500, 554, 1212))
predict(forwardAIC, newdata, interval = "confidence")
##        fit      lwr      upr
## 1 73.30214 72.73315 73.87114
## 2 70.80602 70.41971 71.19233
## 3 68.15925 67.45284 68.86566
predict(forwardAIC, newdata, interval = "prediction")
##        fit      lwr      upr
## 1 73.30214 71.74493 74.85935
## 2 70.80602 69.30589 72.30615
## 3 68.15925 66.54675 69.77176

Extending Model Flexibility

tests = read.table('2.a_Test_Scores.txt')
tests$Gender = as.factor((tests$Gender))

perform numerical EDA

sd(tests$Test.Score)
## [1] 8.715371
sd(tests$Hours.Studied)
## [1] 0.236849
cor(tests$Hours.Studied, tests$Test.Score)
## [1] 0.7883204

perform graphical EDA

plot(tests$Hours.Studied, tests$Test.Score)

fit a linear model

model.simple = lm(Test.Score ~ Hours.Studied, data = tests)
summary(model.simple)
## 
## Call:
## lm(formula = Test.Score ~ Hours.Studied, data = tests)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3385  -4.0140   0.3501   3.8466  16.7287 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     11.880      2.035   5.837  9.6e-09 ***
## Hours.Studied   29.008      1.015  28.593  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.368 on 498 degrees of freedom
## Multiple R-squared:  0.6214, Adjusted R-squared:  0.6207 
## F-statistic: 817.5 on 1 and 498 DF,  p-value: < 2.2e-16
influencePlot(model.simple)

add confidence and prediction bands

newdata = data.frame(Hours.Studied = seq(1, 3, length = 100))
conf.band = predict(model.simple, newdata, interval = "confidence")
pred.band = predict(model.simple, newdata, interval = "prediction")

plot(tests$Hours.Studied, tests$Test.Score,
     xlab = "Hours Studied", ylab = "Test Score",
     main = "Simple Linear Regression Model\nTests Dataset")
abline(model.simple, lty = 2)
lines(newdata$Hours.Studied, conf.band[, 2], col = "blue") #Plotting the lower confidence band.
lines(newdata$Hours.Studied, conf.band[, 3], col = "blue") #Plotting the upper confidence band.
lines(newdata$Hours.Studied, pred.band[, 2], col = "red") #Plotting the lower prediction band.
lines(newdata$Hours.Studied, pred.band[, 3], col = "red") #Plotting the upper prediction band.
legend("topleft", c("Regression Line", "Conf. Band", "Pred. Band"),
       lty = c(2, 1, 1), col = c("black", "blue", "red"))

adding a quadratic term

model.quadratic = lm(Test.Score ~ Hours.Studied + I(Hours.Studied^2), data = tests)

summary(model.quadratic) 
## 
## Call:
## lm(formula = Test.Score ~ Hours.Studied + I(Hours.Studied^2), 
##     data = tests)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9941  -3.9011   0.1961   3.7502  16.8878 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          37.903     12.977   2.921  0.00365 **
## Hours.Studied         2.704     12.995   0.208  0.83523   
## I(Hours.Studied^2)    6.554      3.228   2.030  0.04286 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.351 on 497 degrees of freedom
## Multiple R-squared:  0.6246, Adjusted R-squared:  0.6231 
## F-statistic: 413.4 on 2 and 497 DF,  p-value: < 2.2e-16
plot(model.quadratic)

influencePlot(model.quadratic)

predict and create confidence bands

conf.band = predict(model.quadratic, newdata, interval = "confidence")
pred.band = predict(model.quadratic, newdata, interval = "prediction")

plot(tests$Hours.Studied, tests$Test.Score,
     xlab = "Hours Studied", ylab = "Test Score",
     main = "Quadratic Regression Model\nTests Dataset")
lines(tests$Hours.Studied[order(tests$Hours.Studied)],
      model.quadratic$fitted.values[order(tests$Hours.Studied)], lty = 2)
lines(newdata$Hours.Studied, conf.band[, 2], col = "blue") #Plotting the lower confidence band.
lines(newdata$Hours.Studied, conf.band[, 3], col = "blue") #Plotting the upper confidence band.
lines(newdata$Hours.Studied, pred.band[, 2], col = "red") #Plotting the lower prediction band.
lines(newdata$Hours.Studied, pred.band[, 3], col = "red") #Plotting the upper prediction band.
legend("topleft", c("Regression Line", "Conf. Band", "Pred. Band"),
       lty = c(2, 1, 1), col = c("black", "blue", "red"))

add a factor

model.factor = lm(Test.Score ~ Hours.Studied + Gender, data = tests)

Gender has been automatically dummified into ‘GenderMale’ This model trains a different y-intercept for each gender, not slope ascess

summary(model.factor) #Investigating the model and assessing some diagnostics.
## 
## Call:
## lm(formula = Test.Score ~ Hours.Studied + Gender, data = tests)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.8361  -3.3240  -0.0968   3.3081  14.2784 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    15.1032     1.8188   8.304 9.61e-16 ***
## Hours.Studied  28.6541     0.8969  31.947  < 2e-16 ***
## GenderMale     -5.0372     0.4245 -11.868  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.743 on 497 degrees of freedom
## Multiple R-squared:  0.705,  Adjusted R-squared:  0.7038 
## F-statistic:   594 on 2 and 497 DF,  p-value: < 2.2e-16
plot(model.factor)

influencePlot(model.factor)

predict and plot

col.vec = c(rep("pink", 250), rep("blue", 250))

plot(tests$Hours.Studied, tests$Test.Score, col = col.vec,
     xlab = "Hours Studied", ylab = "Test Score",
     main = "Linear Regression Model w/ Factor\nTests Dataset")
abline(model.factor$coefficients[1], #Intercept for females.
       model.factor$coefficients[2], #Slope for females.
       lwd = 3, lty = 2, col = "pink")
abline(model.factor$coefficients[1] + model.factor$coefficients[3], #Intercept for males.
       model.factor$coefficients[2], #Slope for males.
       lwd = 3, lty = 2, col = "blue")
legend("topleft", c("Female Regression", "Male Regression"),
       lwd = 3, lty = 2, col = c("pink", "blue"))

To train a new slope for each gender

model.factor = lm(Test.Score ~ Gender*Hours.Studied + Gender, data = tests)
summary(model.factor) #Investigating the model and assessing some diagnostics.
## 
## Call:
## lm(formula = Test.Score ~ Gender * Hours.Studied + Gender, data = tests)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.8757  -3.2540  -0.1977   3.3211  14.0933 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               13.1053     2.5276   5.185 3.15e-07 ***
## GenderMale                -0.9715     3.5982  -0.270    0.787    
## Hours.Studied             29.6530     1.2549  23.630  < 2e-16 ***
## GenderMale:Hours.Studied  -2.0410     1.7937  -1.138    0.256    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.741 on 496 degrees of freedom
## Multiple R-squared:  0.7058, Adjusted R-squared:  0.704 
## F-statistic: 396.7 on 3 and 496 DF,  p-value: < 2.2e-16
plot(model.factor)

influencePlot(model.factor)

predict and plot

col.vec = c(rep("pink", 250), rep("blue", 250))

plot(tests$Hours.Studied, tests$Test.Score, col = col.vec,
     xlab = "Hours Studied", ylab = "Test Score",
     main = "Linear Regression Model w/ Factor\nTests Dataset")
abline(model.factor$coefficients[1], #Intercept for females.
       model.factor$coefficients[2], #Slope for females.
       lwd = 3, lty = 2, col = "pink")
abline(model.factor$coefficients[1] + model.factor$coefficients[3], #Intercept for males.
       model.factor$coefficients[2], #Slope for males.
       lwd = 3, lty = 2, col = "blue")
legend("topleft", c("Female Regression", "Male Regression"),
       lwd = 3, lty = 2, col = c("pink", "blue"))

Logistic Regression

The Goal is to predict a categorical variable

Sigmoid Curves

  • Is an S-Shaped curve between two values 0 and 1 (1 being the default value)
  • Interpret the curve as a probability to be 0 or 1

\(p = \frac{e^{}X\beta}{1+e^{}X\beta} = \frac{1}{1+ e^{-X\beta}}\)
where \(X\beta\) is the product of the input matrix and coefficient vector
can calculate the odds is the probability of success over the probability of not success
\(Odds = \frac{p}{1-p}\)

\(X\beta = \ln{\frac{p}{1-p}}\)
the linear output of the model is the log odds

\(\beta_0\) will shift the graft horizontaly
\(\beta_1\) will alter the slope of the sigmoid, large \(\beta_1\)’s will create a sharp rise in the curve

Likelyhood

Instead of trying to minimize the sum or squared error through linear regression we will try to find the maximum likelihood estimation

  • Our dataset consists of actual observed outcomes; for the data we intend to use for logistic regression, the outcomes are one of two categories:
  • Success, which we will label ‘1’
  • Failure, which we will label as ‘0’
  • Using the logistic function we just derived, we can nd the probability of success by passing our observations through the function:

\(p(x) = \frac{e^{}X\beta}{1+e^{}X\beta} = \frac{1}{1+ e^{-X\beta}}\)

  • we can calculate the probability of failure by calculating \(1-p(x)\)

  • Since the observations are assumed to be independent of one another, we can calculate the probability of our observed data by multiplying these probabilities together:

\(l(\beta_0, \beta_1, ..., \beta_p) = \prod_{i:y_i=1}p(x)\prod_{j:y_j=0}(1-p(x_j))\)

  • The first term is the multiply the probabilities for each term that is a success

  • the second term is the multiplication of all values for each failure term

  • This equation represents the joint probability of successes and failures within our dataset; notice that the resulting equation involves the unknown parameters.

The Best model maximizes likelyhood + The parameter estimates \((\beta_0, \beta_1, ...etc)\) that yield the highest likelyhood are called MLE (maximum likelyhood estimates)

  • to make the math easier rwe frequently take the log of the likelyhood remeber:
  • \(log{(x * y)} = \log{x} + \log{y}\)
  • frequently we only have a minimize function to work with in our codeing so instead of maximizing likelyhood we want to minimize \(-log(likelyhood)\)

**after succesfully run MLE the regression model is given by:
\(\ln{\frac{\hat{p}X}{1-\hat{p}X}} = \hat{\beta_0}+ \hat{\beta_1}X_1+ ...+ \hat{\beta_p}X_p = X\hat{\beta}\)

exponentiating both sides gives us:
\(\frac{\hat{p}X}{1-\hat{p}X} =e^{\hat{\beta_0}+ \hat{\beta_1}X_1 + ... + \hat{\beta_p}X_p}= e^{X\hat{\beta}}\)

  • The left hand side of the equation is the fitted odds of success

  • The right hand side of the equation is transformed from an additive relationship on the log-odds scale to a multiplicative relationship on the odds scale.

  • for a one unit increase in \(X_i\) the log odds of success are increased (or decreased) by multiplying \(\beta_i\), holding all other variable constant

Prediction

We need to find a cut off value for our model so that:

$$ = \[\begin{cases} Success(1) & \hat{p_i} \ge c\\ Failure(0) & \hat{p_i} \le c \end{cases}\]

$$
+ The threshold ‘c’ is decided based on the model itself, a common value for c is 0.5
+ Take into account:
- The randomness of data
- How likely we are to see a success/failure in the population of interest - The costs of incorrectly predicting a success/failure

Hypothesis Testing

Wald Test
Wald tests are analogous to T-Tests in linear regression. It tests the normality of the MLEs.

  • For logistic regression we imply tat each coefficient estimate has a sampling distribution that is approximately normal. Can test this by calculating the z statistic:

\(z^* = \frac{\hat{\beta_i}-\beta_i} {SE(\hat{\beta_i})} \approx N(0,1)\)
+ The standard error is the estimated standard deviation of the sampling distribution of the coefficient estimate, and will be calculated for us by software.
+ Null hypothesis: \(H_0: \beta_i = 0\)
+ Alternative Hypothesis \(H_a:\beta_i \ne 0\)
If the null hypothesis is true (very small p value):
+ The log odds of success are unaffected by the value of \(x_i\)
+ In other words, knowing the value of \(X_i\) has no bearing on the prediction of our success

Confidence Interval

confidence interval for log odds:

\(\hat{\beta_i}} \pm 2 * SE(\hat{\beta_i})\)

95% confidence interval for odds is:

\(e^{\hat{\beta_i}\pm 2 * SE(\hat{\beta_i})}\)

Deviance And Goodness of Fit

we replace the concept of residual sum of squares with the concept of deviance \(G^2\)

  • The deviance, in part, is analogous to the idea of the coefficient of determination used in linear regression, and allows us to compare across various models.
    The Deivance is given by:

\(G^2 = -2ln(l_M) \approx X_{n-p}^2\)

  • The deviance associated with a given logistic regression model M is based on comparing the maximum log-likelihood of model M against the saturated model S.

  • S represents a model that has a parameter t per observation in our dataset; i.e., this model severely overfits the data at hand.

  • The null hypothesis for the deviance test is:

\(H_0: error = 0\) (the model is a perfect fit)

This overall goodness of fit test is similar to the overall F-Test we saw in MLR, but the hypothesis are switched

Comparing models Say we have a full logistic regression model with p predictors and a reduces model with a subset of p to compare: (sometimes called Residual Deviance)

\((G_{reduced}^2 - G_{Full}^2) \approx X_{df_{reduced}-df_{full}}^2\)

The test statistic calculated from the difference between the deviance of two nested logistic regression models performs the following hypothesis test:

\(H_0\) The reduced model is sufficient

Null Deviance
+ calculated based on comparing the deviance of the saturated model versus the model with no predictors (just a \(\beta_0\) term)

The signicance of a single term can be tested using the drop in deviance test; however, this is not the same as the Wald test as you would expect from what we saw in multiple linear regression.

McFadden’s Psuedo R Squared

\(R_{deviance}^2 = 1- \frac{G_M^2}{G_{Null}^2}\)
\(R_{dev}^2\) is bound between 0 and 1 a higher value means a stronger fit

Model Fitting and Evaluation

Analysis proceeds in a similar fashion as in multiple linear regression analysis;however, since it is difficult to assess model t from scatter plots or residual analysis, model comparisons comprise most of the exploration.
+ Measures of AIC and BIC can be used to compare across models
+ Can still use stepwise procedures to find a suitable model

  • Confusion Matrix
  • Accuracy (The most common method)
    \(\frac{TP + TN}{TP + FP + TN + FN}\)
  • Sensitivity (or recall)

\(\frac{TP}{TP + FN}\)
- Specificity

\(\frac{TN}{TN + FP}\)
- Precision

\(\frac{TP}{TP + FP}\)
- F-Score

\(\frac{2 * precision * recall}{precision + recall}\)

  • Kappa - adjusted accuracy by accounting for possibility of prediction by chance
    \(\kappa = \frac{pr.actual - pr.expected}{1-pr.expected}\)

  • ROC and AUC

ROC Receiver Operating Characteristic

  • an ROC graph is a curve of 1-specificity on the X-axis and Sensitivity on the y axis for different threshold values

Example

Variables in the dataset: - Admit: A binary variable indicating whether or not a student was admitted to the graduate school.
- GRE: A continuous variable indicating a student’s score on the GRE.
- GPA: A contunuous variable indicating a student’s undergraduate grade point average.
- Rank: A categorical variable indicating the level of prestige of a school; 1 indicates the highest prestige, 4 indicates the lowest prestige.

GradSchools = read.table("Graduate_Schools.txt")

head(GradSchools)
summary(GradSchools) #Looking at the five number summary information.
##      admit             gre             gpa             rank      
##  Min.   :0.0000   Min.   :220.0   Min.   :2.260   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130   1st Qu.:2.000  
##  Median :0.0000   Median :580.0   Median :3.395   Median :2.000  
##  Mean   :0.3175   Mean   :587.7   Mean   :3.390   Mean   :2.485  
##  3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :800.0   Max.   :4.000   Max.   :4.000
sapply(GradSchools, sd) #Looking at the individual standard deviations.
##       admit         gre         gpa        rank 
##   0.4660867 115.5165364   0.3805668   0.9444602
sapply(GradSchools, class) #Looking at the variable classes.
##     admit       gre       gpa      rank 
## "integer" "integer" "numeric" "integer"

check to see if we have all 4 permuatations for the confusion matrix

table(GradSchools$admit)/nrow(GradSchools) #Manually calculating the proportions.
## 
##      0      1 
## 0.6825 0.3175
table(GradSchools$admit, GradSchools$rank) #Checking to see that we have data
##    
##      1  2  3  4
##   0 28 97 93 55
##   1 33 54 28 12
                                           #available in all combinations of
                                           #the categorical variables.

Graphical EDA

plot(GradSchools, col = GradSchools$admit + 2)

convert Grad school rank to factor

GradSchools$rank = as.factor(GradSchools$rank)

Create a logistic Model

Fitting the logistic regression with all variables; the family parameter
specifies the error distribution and link function to be used. For logistic
regression, this is binomial.

logit.overall = glm(admit ~ gre + gpa + rank,
                    family = "binomial",
                    data = GradSchools)

Plot Residuals

Residual plot for logistic regression with an added loess smoother; we would
hope that, on average, the residual values are 0.

scatter.smooth(logit.overall$fit,
               residuals(logit.overall, type = "deviance"),
               lpars = list(col = "red"),
               xlab = "Fitted Probabilities",
               ylab = "Deviance Residual Values",
               main = "Residual Plot for\nLogistic Regression of Admission Data")
abline(h = 0, lty = 2)

Influence plot of model

library(car)
influencePlot(logit.overall)

Investiagte Overall Fit of Model on log odds scale

summary(logit.overall)
## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
##     data = GradSchools)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6268  -0.8662  -0.6388   1.1490   2.0790  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rank2       -0.675443   0.316490  -2.134 0.032829 *  
## rank3       -1.340204   0.345306  -3.881 0.000104 ***
## rank4       -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4

Coefficient interpretations on the log odds scale:
- Intercept: The log odds of a student getting admitted to a graduate school
when they attended a top tier undergraduate school and received a
0 on the GRE and a 0 as their GPA is approximately -3.990.
- GRE: For every additional point a student scores on the GRE, their log odds
of being admitted to graduate school increase by approximately 0.002,
holding all other variables constant.
- GPA: For every additional point a student raises their GPA, their log odds of
being admitted to graduate school increase by approximately 0.804, holding
all other variables constant.
- Rank: The change in log odds associated with attending an undergraduate school
with prestige of rank 2, 3, and 4, as compared to a school with prestige
rank 1, is approximately -0.675, -1.340, and -1.552, respectively, holding
all other variables constant.

Investiagte Overall Fit of Model on odds scale

exp(logit.overall$coefficients)
## (Intercept)         gre         gpa       rank2       rank3       rank4 
##   0.0185001   1.0022670   2.2345448   0.5089310   0.2617923   0.2119375

Coefficient interpretations on the odds scale:
- Intercept: The odds of a student getting admitted to a graduate school
when they attended a top tier undergraduate school and received a
0 on the GRE and a 0 as their GPA is approximately 0.019.
- GRE: For every additional point a student scores on the GRE, their odds
of being admitted to graduate school multiply by approximately 1.002,
holding all other variables constant.
- GPA: For every additional point a student raises their GPA, their odds of
being admitted to graduate school multiply by approximately 2.235, holding
all other variables constant.
- Rank: The multiplicative change in odds associated with attending an undergraduate school
with prestige of rank 2, 3, and 4, as compared to a school with prestige
rank 1, is approximately 0.509, 0.262, and 0.212, respectively, holding
all other variables constant.

Investigate Relationship of Odds and log odds

For logistic regression objects, the confint() function defaults to using the log likelihood to generate confidence intervals; this is similar to inverting the likelihood ratio test.

cbind("Log Odds" = logit.overall$coefficients,
      "Odds" = exp(logit.overall$coefficients))
##                 Log Odds      Odds
## (Intercept) -3.989979073 0.0185001
## gre          0.002264426 1.0022670
## gpa          0.804037549 2.2345448
## rank2       -0.675442928 0.5089310
## rank3       -1.340203916 0.2617923
## rank4       -1.551463677 0.2119375
confint(logit.overall)
## Waiting for profiling to be done...
##                     2.5 %       97.5 %
## (Intercept) -6.2716202334 -1.792547080
## gre          0.0001375921  0.004435874
## gpa          0.1602959439  1.464142727
## rank2       -1.3008888002 -0.056745722
## rank3       -2.0276713127 -0.670372346
## rank4       -2.4000265384 -0.753542605

Generate Confidence Intervals

confint.default(logit.overall)
##                     2.5 %       97.5 %
## (Intercept) -6.2242418514 -1.755716295
## gre          0.0001202298  0.004408622
## gpa          0.1536836760  1.454391423
## rank2       -1.2957512650 -0.055134591
## rank3       -2.0169920597 -0.663415773
## rank4       -2.3703986294 -0.732528724

Generate confidence intervals on the odds scale

exp(confint(logit.overall))
## Waiting for profiling to be done...
##                   2.5 %    97.5 %
## (Intercept) 0.001889165 0.1665354
## gre         1.000137602 1.0044457
## gpa         1.173858216 4.3238349
## rank2       0.272289674 0.9448343
## rank3       0.131641717 0.5115181
## rank4       0.090715546 0.4706961
exp(confint.default(logit.overall))
##                   2.5 %    97.5 %
## (Intercept) 0.001980825 0.1727834
## gre         1.000120237 1.0044184
## gpa         1.166121956 4.2818768
## rank2       0.273692172 0.9463578
## rank3       0.133055086 0.5150889
## rank4       0.093443470 0.4806919

perform drop in deviance test to see if a variable (in this case rank) changes the residuals, drop it

logit.norank = glm(admit ~ gre + gpa,
                   family = "binomial",
                   data = GradSchools)

Check the deviances against each other

reduced.deviance = logit.norank$deviance #Comparing the deviance of the reduced
reduced.df = logit.norank$df.residual    #model (the one without rank) to...

full.deviance = logit.overall$deviance #...the deviance of the full model (the
full.df = logit.overall$df.residual    #one with the rank terms).

pchisq(reduced.deviance - full.deviance,
       reduced.df - full.df,
       lower.tail = FALSE)
## [1] 7.088456e-05

Above: we can see The p-value is extremely small (<.0005); we have evidence to conclude that the model with the factors for rank is preferable to the model without the factors for rank.
More simply, we can use the anova() function and set the test to “Chisq”.

anova(logit.norank, logit.overall, test = "Chisq")

cehck predicitons How does the probability of admission change across ranks for a student
who has an average GRE and an average GPA?
This method gives log odds

newdata = with(GradSchools, data.frame(gre = mean(gre),
                                       gpa = mean(gpa),
                                       rank = factor(1:4)))
predict(logit.overall, newdata)
##           1           2           3           4 
##  0.06643085 -0.60901208 -1.27377307 -1.48503283

to get actual probabilities set type to ‘response’

predict(logit.overall, newdata, type = "response")
##         1         2         3         4 
## 0.5166016 0.3522846 0.2186120 0.1846684

make it easier to see the effect of the rank variable

cbind(newdata, "P_Admit" = predict(logit.overall, newdata, type = "response"))

convert the fitted probabilities to binary based on a cut off value of .5 with the round function

admitted.predicted = round(logit.overall$fitted.values)
table(truth = GradSchools$admit, prediction = admitted.predicted)
##      prediction
## truth   0   1
##     0 254  19
##     1  97  30

It seems like this model made a lot of mistakes (116 out of 400)! This is quite
dreadful in this case. Let’s do a little bit more exploring. We never looked at
the overall test of deviance:
check deviance

pchisq(logit.overall$deviance, logit.overall$df.residual, lower.tail = FALSE)
## [1] 0.01365347

The p-value for the overall test of deviance is <.05, indicating that this model is not a good overall fit!

**Check McFadden’s Psuedo \(R^2\)

1 - logit.overall$deviance/logit.overall$null.deviance
## [1] 0.08292194

Only about 8.29% of the variability in admission appears to be explained by
the predictors in our model; while the model is valid, it seems as though it
isn’t extremely informative.

What have we found out? The overall model we created doesn’t give us much
predictive power in determining whether a student will be admitted to
graduate school.

table(GradSchools$admit) #Our data contains 273 unadmitted students and 127
## 
##   0   1 
## 273 127
                         #admitted students.
table(admitted.predicted) #The model we created predicts that 351 students will
## admitted.predicted
##   0   1 
## 351  49
                          #not be admitted, and only 49 will be admitted.
table(truth = GradSchools$admit, prediction = admitted.predicted)
##      prediction
## truth   0   1
##     0 254  19
##     1  97  30

The table of the truth against the prediction shows that we only have an accuracy
of (254 + 30)/400 = 71%; yet, if we were to simply predict “unadmitted” for
everyone uniformly, we would have an accuracy of 2 73/400 = 68.25%! No wonder
why the overall test of deviance was insignificant – predicting only the
intercept of the baseline probability was just as sufficient!

Shiny

Basic Layout (ui.r)

fluidPage(
    titlePanel(...),
    sidebarLayout(
      sidebarPanel(...),
      mainPanel(...)
)
)
library(shiny)
library(lubridate)

# Define UI for application that draws a histogram

fluidPage(
  titlePanel("NYC Flights 2014"),
  sidebarLayout(
    sidebarPanel(
     
      selectizeInput(inputId = "origin",
                    label = "Departure airport",
                     choices = unique(flights$origin)),
      selectizeInput(inputId = "dest",
                     label = "Arrival airport",
                     choices = unique(flights$dest)),
      selectizeInput(inputId = "month",
                     label = "Month",
                     choices = unique(flights$month)
    ),
    mainPanel(tabsetPanel(
      tabPanel('Number of Flights', plotOutput("count")),
      tabPanel('Average Delay', plotOutput("delay"))
    ))
  )
)

Basic Server (server.r)

function(input, output) {
}
library(shiny)
library(dplyr)
library(ggplot2)
library(tidyverse)

function(input, output) {
  output$count <- renderPlot(
    flights %>%
      filter(origin == input$origin & dest == input$dest & month == input$month) %>%
      group_by(carrier) %>%
      count() %>%
      ggplot(aes(x = carrier, y = n)) +
      geom_col(fill = "lightblue") +
      ggtitle("Number of flights"))
  output$delay <-renderPlot(
    flights %>%
      filter(origin == input$origin & dest == input$dest & month == input$month) %>%
      group_by(carrier) %>%
      summarise(mean_arr = mean(arr_delay), mean_del = mean(dep_delay)) %>%
      gather(key = metric, value = delay, mean_arr, mean_del)  %>%
      ggplot(aes(x = carrier, y = delay, fill = metric )) +
      geom_col(position = "dodge2") +
      ggtitle('Average Delays')
      
  )
  
  
}

Global .r

#for above example
library(data.table)
flights <- fread(file = "./flights14.csv")

HTML

Workflow

Start new document index.html with !DOCTYPE

Parts of HTML Document

<title> </title> <head> </head>
contains technical info about meta structure of doc
<body> </body>
content goes here Lorem# is number of lorem ipsum words to create

Headers and Paragraphs and divs

<h1> </h1>
<p> </p>
<div> </div>

Insert an image

<a href=“about.html”>About</a>

CSS

for external css create a style.css page

Inline CSS

<h1 style=“color: olive;font-size: 3rem;”>hello world</h1>

Internal CSS

<style> h1{
color:blue;
font-size: 3rem;
}
will change the apperance of all <h1> elments
will will with spans, body, divs, paragraphs, etc… </style>

External CSS

Element selectors and grouping

h1,h2{
color: red;
}
### IDs
/#heading{
background:black;
}
<h1 id=“heading”>lorem ipsum </h1>

Class Selectors

.green{
color:green;
}
<h3 class=“green”>I’m green</h3>

Universal Selector

*{
background-color:ivory
}
Universal selector will be the top level css style, unless overwritten by subsequent styles

RGBA 0-1 Opacity/Transparency

#<tag>{
background:rgba(0,0,0,.25)
<(r value, g value, b value, opacity)>
}

Font size, width, height, pixels

{
font-size: 30px;
width: 200px;
height: 300px;
}

Percent Value Relative

{
width: 50%;
height: 50%;
}

SQL

comments

-- Sytanx for single line comments
/** multiline comment **/

Commands

SHOW <DATABASES> or SHOW <database, or table>;

shows all databases on server, or shows DB or table

USE <database>

will append database. to every table reference for example:
USE movies_db; is the same as
FROM movies_db.tables;

SELECT <rows>

selects all rows

FROM <database>

noun of the select querry

WHERE …

indicates the condition or conditions that rows must satisfy to be selected.

SELECT movie_title,
budget
FROM movies
WHERE budget > 1000000000;

LIMIT …

SELECT column_names  
FROM tabel_names  
LIMIT \<offset\> \<count\>;  

AS …

see column alias

CASE WHEN <condition> THEN <result> ELSE <result> END

SELECT movie_title,   
CASE   
       WHEN title_year > 2014 THEN 'new' 
       WHEN title_year BETWEEN 2012 AND 2014 THEN 'new-ish'   
ELSE 'old'  
END AS movie_age  
FROM movies;  

DESCRIBE <object>

shows column names and values

Column Alias

Select AS

Mathmatical Operators

+
-
*
/
Round()

Comparison Operator

= , equal to
> , Greater than
< , less than
>= , greater than or equal to
<= , less than or equal to
<> , != , not equal to

Logical Operators

AND

Evaluates two conditions and returns True if both are true.

OR

Evaluates two conditions and returns True if at least one is true.

BETWEEN…AND…

Tests value against an inclusive range of values.

SELECT movie_title,  
title_year BETWEEN 2010 AND 2012 AS year_range  
FROM movies;  

IN

Tests value against a list of values and returns True if value matches at least one in the list.

SELECT movie_title, 
title_year IN (2010, 2012, 2014) AS year_range,  
country IN ('USA', 'UK') AS country_range  
FROM movies;  

IS NULL

Tests value to see if it is null. Return True if null.

NOT

Reverts False to True, and True to False.

String Manipulation

Wildcard Comparisons

LIKE operator in combo with wildcards are used for pattern matching
% used to define any number of missing characters before and after a pattern
_ wildcard used to substitute for a single character or space
example:
Return True if the country name starts with a ‘U’ else return false

SELECT country LIKE 'U%'  
FROM MOVIES;  

Return True if country starts with ‘U’ and ends with ‘A’ an has one or more spaces in between

SELECT country LIKE 'U_A'  
FROM MOVIES;  

Functions

LOWER(), LCASE() Return string in lowercase
UPPER(), UCASE() Return string in uppercase
CHAR_LENGTH() Return number of characters in a string
LEFT(, n) Return the leftmost number of characters specied by n
RIGHT(, n) Return the rightmost number of characters specied by n
SUBSTR(, m, n) Return the number of characters specied by n starting from the m-th position of the string
LTRIM() Return the string with leading spaces removed
RTRIM() Return the string with trailing spaces removed
TRIM() Return the string with leading and trailing spaces removed
REPLACE(‘s’, ‘s1’, ‘s2’) Replaces all the occurrences of a within a string with a new substring
COUNT

SELECT count (*)   
from \<table name\>;  

counts all rows

SELECT COUNT(column name)  
from movies;  

SUM

SELECT SUM(title_year IS NULL)  
FROM movies;  

counts number of null values in title year column
AVG

Select AVG(title_year IS NULL)   
FROM movies;  

will give the ratio of null values
DISTINCT

SELECT COUNT(DISTINCT country)  
FROM movies;  

MAX
Return the maximum value
MIN
Return the minimum value
STD
Return the population standard deviation

VARIANCE
Return the population variance

HAVING
used to filter according to result of aggregation

SELECT country, COUNT(DISTINCT director_name)  
FROM movies  
GROUP BY country  
HAVING COUNT(DISTINCT director_name) > 5;  

UNION

SELECT movie_title, title_year, director_name  
FROM movies  
WHERE title_year = 1972  
UNION  
SELECT movie_title, title_year, director_name  
FROM movies  
WHERE title_year = 1942;  

UNION_ALL
keeps duplicate records in a union

JOIN
by default is inner_join

SELECT expression [, expression ...]  
FROM table1 JOIN table2  
ON condition-involving-table1-to-table2;  
SELECT *  
FROM directors JOIN actors  
ON directors.name = actors.name; 

LEFT OUTER JOIN
select all records from left table and only records on right table that match

SELECT * FROM movies LEFT OUTER JOIN directors  
ON movies.director_name = directors.name;  

FULL OUTER JOIN
There is no full outer join command

SELECT * FROM movies LEFT OUTER JOIN directors  
ON movies.director_name = directors.name  
UNION  
SELECT * FROM movies RIGHT OUTER JOIN directors  
ON movies.director_name = directors.name;  

Exclusion Join
Let’s say we want to look at directors who are NOT actors. These will be directors who do not fall in the actors table.

SELECT * FROM directors LEFT OUTER JOIN actors  
ON directors.name = actors.name  
WHERE actors.name IS NULL;  

GROUP BY

SELECT country, AVG(imdb_score)  
FROM movies  
GROUP BY country;  

show the average imdb score for each country

SELECT country, COUNT(DISTINCT director_name)  
FROM movies  
GROUP BY country;  

shows number of directors in each country

WHERE key word is used before group by
HAVING is used after group by

Multiple GROUP BYs

SELECT country, title_year, COUNT(*)  
FROM movies  
GROUP BY country, title_year;  
gives the number of movies in each country per year 

ORDER BY

SELECT country  
FROM movies  
ORDER BY country;  

or

SELECT country, language  
FROM movies  
ORDER BY country, language DESC;  

Order of Evaluation

1 FROM
2 WHERE
3 GROUP BY
4 HAVING
5 SELECT

Subqueries

  • A subquery is a SQL statement that has another SQL query embedded in the WHERE or the HAVING clause.
  • The syntax for a subquery when the embedded SQL statement is part of the WHERE condition is as follows:
SELECT expressions  
FROM tables  
WHERE conditions [Comparison Operators]  
(SELECT expressions  
FROM tables  
WHERE conditions);  
  • [Comparison Operators] could be equality operators such as =, >, <, >=, and <=. It can also be a text operator such as “LIKE”.
  • The second select clause is considered as the “inner query,” while the rst select clause is considered as the “outer query.”

Example: we want to find the highest avaerage rating from greouped genres

SELECT MAX(average.avg_rating)
FROM (
    SELECT AVG(imdb_score) AS avg_rating
    FROM movies
    GROUP BY genres
) AS average;

ALL

Compares a value to ervery value from the result table
example: the following query returns all of the action movies that have a higher IMDB score than the best fantasy movie:

SELECT *
FROM movies
WHERE genres = 'Action'
AND imdb_score > ALL
    (SELECT imdb_score
    FROM movies
    WHERE genres = 'Fantasy');

EXISTS

The EXISTS operator checks if the row from the subquery matches any row in the outer query. If there’s no data matched, the EXISTS operator returns FALSE.
The following query returns all the movies that are directed by director who are born after 01/01/1980.

SELECT movie_title
FROM movies
WHERE EXISTS (SELECT *
    FROM directors
    WHERE directors.name = movies.director_name
    AND birthday > '1980-01-01')

Windows Functions

RANK() OVER

SELECT
    gross, RANK() OVER (ORDER BY gross)
    FROM movies
    WHERE gross IS NOT NULL;

Cutting with NTILE()

A very common analytical method is to cut numeric values into a few categories.NTILE could help us to do so:

SELECT
gross,
NTILE(5) OVER (ORDER BY gross DESC) AS tier
  FROM movies
  WHERE gross IS NOT NULL;

breaks into 5 teirs

Cumulative Sums

SELECT gross,
RANK() OVER (ORDER BY gross) AS _rank,
SUM(gross) OVER (ORDER BY gross) AS cum_sum,
AVG(gross) OVER (ORDER BY gross) AS cum_avg
FROM movies
WHERE gross IS NOT NULL;

Lag and rolling values

SELECT
    gross,
  AVG(gross) OVER (ORDER BY gross) AS cum_avg,
  AVG(gross) OVER (
    ORDER BY gross
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
      ) AS moving_avg
FROM movies
WHERE gross IS NOT NULL;

the moving average takes the average of the current row ans the 2 previous

SELECT
    gross,
  LAG(gross) OVER(ORDER BY gross) AS delay,
  gross - LAG(gross) OVER(ORDER BY gross) AS diff,
  (gross - LAG(gross) OVER(ORDER BY gross))/gross AS rate
FROM movies
WHERE gross IS NOT NULL;

LAG function will generate the previous rows value LAG(<column>, # to lag) defaults to 1

PARTITION

Rank facebook likes by tier

SELECT tier, movie_title,
    RANK() OVER
    (PARTITION BY tier ORDER BY movie_facebook_likes DESC)
FROM (
    SELECT
      movie_title, num_critic_for_reviews,
      movie_facebook_likes, budget,
      gross, NTILE(5) OVER (ORDER BY gross DESC) AS tier
FROM movies
WHERE gross IS NOT NULL
) Q;

Data Manipulation

CREATE TABLE

CREATE TABLE table_name (
column_name column_type [options],
...
)

example

CREATE TABLE actors (
biography VARCHAR(6000),
birthday DATE DEFAULT NULL,
deathday DATE DEFAULT NULL,
gender INT DEFAULT NULL,
name VARCHAR(255),
place_of_birth VARCHAR(255),
facebook_likes INT DEFAULT NULL
);

for more options/flags
https://dev.mysql.com/doc/refman/5.7/en/create-table.html

INSERT

INSERT INTO table_name
(column_name_1, column_name_2, ...)
VALUES
(value_row1_col1, value_row1_col2, ...),
(value_row2_col1, value_row2_col2, ...),
...
;
INSERT INTO actors
(birthday,deathday,gender,name,place_of_birth,facebook_likes)
VALUES
("1942-08-17",NULL,0,"Roshan Seth","Patna, Bihar, India",61.0);

UPDATE

UPDATE table_name SET column_name = new_value
[WHERE where_condition];

set SQL into safe updates mode first SET SQL_SAFE_UPDATES = 0;

UPDATE actors SET facebook_likes = 68
WHERE name = "Roshan Seth";

DELETE

DELETE FROM tbl_name
[WHERE where_condition];

TRUNCATE

to empty a table

TRUNCATE TABLE table_name;

DROP

to fully remove a table

DROP TABLE table name;

Create Temporary Table

select
  select list
into
  # temporary table name (name must start with hashtag)
from
  table_name
where
  criteria

SQL Command Line Prompt

in command line
mysql -u<username(no space)> -p<password>

LOAD data

from my SQL command line

LOAD DATA LOCAL INFILE file_name
INTO TABLE table_name
FIELDS TERMINATED BY string_1 ENCLOSED BY string_2
LINES TERMINATED BY string_3
IGNORE int LINES
;

Data Integrity

Primary Key

CREATE TABLE actors (
biography varchar(6000),
birthday date DEFAULT NULL,
deathday date DEFAULT NULL,
gender int DEFAULT NULL,
name varchar(255),
place_of_birth varchar(255),
facebook_likes int DEFAULT NULL,
PRIMARY KEY (name)
);

Foreign key

ALTER TABLE child_table
ADD CONSTRAINT constraint_name
FOREIGN KEY index_col_name
REFERENCES parent_table(index_col_name);
ALTER TABLE movies
ADD CONSTRAINT FK_ActorMovie
FOREIGN KEY (actor_1_name) REFERENCES actors(name);

CASCADE or SET NULL

When you try to delete or change a record that is a primary or foreign key in a table must set cascade ON DELETE CASCADE will change all child records ON DELETE SET NULL - will set null in all child records

Example Exercises

  • Find the average lost of the movies which actually lost money for each country.
SELECT country, AVG(gross - budget)
FROM movies
WHERE (gross - budget) < 0
GROUP BY country;  
  • Find the average lost of the movies for each country whose movies lost money on average.
SELECT country, AVG(gross - budget)  
FROM movies  
GROUP BY country  
HAVING AVG(gross - budget) > 0;  
  • Show the languages and the count of the movies in which each language is spoken. Order the result by the count.
select language,  
    count(*)  
from movies  
group by language  
order by count(*) desc;  
  • Find how many records are in each of the tables of the previous example from performing both UNION and UNION ALL on the actors and directors tables.
SELECT COUNT(*) FROM (  
SELECT name, birthday FROM actors  
UNION  
SELECT name, birthday FROM directors
) AS union_table;  
  • Exercise: How many people are both directors and actors in the dataset?
select count(*) as DirectorActors  
from actors  
join directors  
on (directors.name = actors.name  
directors.birthday = actors.birthday) and;  
\-\- to filter out null values causing incorrect matches   
(directors.birthday = actors.birthday OR  
directors.birthday is NULL AND  
actors.birthday is NULL);  
  • Find the number of directors who are not also in the actors table
select * from directors left outer join actors  
on directors.name = actors.name  
where actors.name is null;  
  • find the number of actors who are not in the directors table
select * from actors left outer join directors  
on actors.name = directors.name  
where directors.name is null;  
  • the number of people who are both directors and actors.
select * from actors inner join directors  
on actors.name = directors.name  
  • Find all the directors who’s facebook_likes are in the top 10 facebook_likes from directors table.
SELECT name, facebook_likes
FROM directors
WHERE facebook_likes >= (
    SELECT facebook_likes
    FROM directors
    ORDER BY facebook_likes DESC
    LIMIT 1 OFFSET 9 -- n-1 here for the top n.
)
ORDER BY facebook_likes DESC;
  • Find the average facebook likes and critics reviews for each of 5 tiers of profitability of the movies DB
SELECT tier,
    AVG(num_critic_for_reviews) AS avg_critics,
    AVG(movie_facebook_likes) AS avg_flikes
FROM (
    SELECT
        movie_title, num_critic_for_reviews,
        movie_facebook_likes, budget,
        gross, NTILE(5) OVER (ORDER BY gross DESC) AS tier
    FROM movies
    WHERE gross IS NOT NULL
) Q
GROUP BY Q.tier;

Concatenation

CONCAT ( <col name>, <col name>,…)

Date Manipulation

Functions

CURRENT_DATE() Return current date.
CURRENT_TIME() Return current time.
DATE() Extract the date part of a date or datetime expression.
DAY(), DAYOFMONTH() Return the day of the month (1- 31)
DAYOFWEEK() Return the weekday index of a date or datetime expression.
MONTH() Return the month part of a date or datetime expression.
YEAR() Return the year part of a date or datetime expression.
DATE_ADD(date, INTERVAL value add unit)
Adds a time/date interval to a date and then returns the date

Docker

Images

  • An executable package that includes everything needed to run an application: code, runtime, libraries, environment variable configuration files
  • can build your own

can find images at https://www.dockerhub.com

Container

  • a runtime instance of a image - is what the image becomes in memory when executed
  • once you have the image pulled you can start the container from the docker engine
  • containers isolate software from its environment.

Commands

List all images

docker image ls

Delete a file

rm

Continue command on new line

^

Pull an image

docker pull <docker IMAGE>

Remove image from computer

docker image rm -f <IMAGE ID>

Run a container

docker run -it -d –rm ^
-p hostPort:containerPort ^
-v host-dir:container-dir ^
(actual example
docker run -it –rm ^
-p 8890:8888 ^
-v %cd%:/home/ubuntu/Workspace ^
nycdsa/linux-toolkits) cd is the current directory
<docker IMAGE>
bash

Options

-it runs a container in interactive processes.
-d (optional) runs a container in the background in detached mode.
–rm (optional) removes the container when it exits or when the daemon exits.
-p hostPort:containerPort (optional) forwards a container᾿s port to the host.
-v host-dir:container-dir (optional) mounts a host directory into a container directory so files can be shared between them.
bash (optional) starts the command line of the container in the terminal.Depending how the image was built, you may or may not need it.

Inspect running elements

docker ps

Inspect stopped containers

docker ps -a

Stop a container

exit

Remove a stopped container

docker container rm

only need to specify enough characters of the container to uniquely identify it

Restart and Log back into container

docker restart

docker exec -it bash

General Windows Terminal

Commands

Echo

echo prints
### Change Directory
cd
### list files and directories
dir

Linux

Commands

date (gets date)

ssh (loginto remote server)

scp (copy from or to a remote machine)

wget or curl (download a webpage)

Command help

command –help

Show entire file system

ls

Change directories

cd

cd or cd ~ changes to home directory

cd .. changes to parent folder

Make DIrectory

mkdir

mkdir -p ~/examples/multiple/levels/down

(-p makes multiple nested directories regardless of weather parts of the specified path exist)

Copy file or Folder

cp [-r]

-r argument means recursively for copying folders

copy a file into home directory; note the .
$ cp /etc/magic .
copy a file into home directory with a new name
$ cp /etc/hosts nethosts
$ ls

Move or rename files

mv

Move curser and keyboard shortcuts

ctrl + a (begginging of line)
crtl + e (end of line)
crtl + u (delete line) Tab (auto-complete)

Remove files

rm rmdir or rm -r (removes empty folders)

Quick create file

touch

I/O redirection

> create
>> append

Download file

wget

File permissions

ls - l (long format) to show permissions

The permissions are broken into 4 sections. For example: drwxr-xr–

  1. d: ‘-’ -> file, ‘d’ -> directory, ‘l’ -> link

  2. rwx: permissions for the owner

  3. r-x: permissions for members of the group owning the file

  4. r-x: permissions for other users

Where ‘r’-> read, ‘w’-> write/delete, ‘x’-> execute, ‘-’-> no permission

Show command line format

echo $PS1 (PS1-4 exist)

change command prompt

export PS1 = “\u \t:$”

will change prompt to user name and time followed by ":$"

Clear screen

crtl + L

Stop process

crtl + C

Stop Program

crtl + D

Relative Path

./ (is current directory)

Parent Directory

../

home directory

~

show where a file is

which

change permission

write = 2

execute = 1

read = 4

chmod 744

owner will be given 7 permission (sum of 4 + 1 + 2)

group will be given 4 permission (sum of 3 + 1)

other users will be given permission 4 (sum of 3 +1)

Text Processing

grep - Search through a given file using a string

grep [<file2]>

wc - Count lines and words

outputs lines/words/characters

sort - Sort all lines in file

sort -k9 (key)(column 9 (counts from 1))

sort -k5 -n (sorts 5th column which contains numbers)

VI

commands

Open vi

vi
will open filename or create it if it doesnt exist

insert mode

I

a - insert after cursor

i - insert before cursor

o - insert a new line below

O - insert a new line above

ESC - exit insert mode

command mode

esc

x - delete one character
dw - delete one word
dd - delete the line
D - delete the rest of the line
u - undo the last action

exit VI

ZZ save file and exit(capital letters)
:wq (save and quit)
:w save without exiting
:q! exit without saving

Git

workflow

pull repo from remote

-git init
or
-git clone
write code
-git add
is now in staging area
-git commit
commit changes and opens editor to add a comment
write code
-git add
-git commit -m “messasge” revert changed git checkout or <head~ID>
brings us to a detached head state
-git checkout master (or main)
reattaches head
for checkouts that could cause conflicts
-git revert
resolve conflicts
-git add .
-git commit -m “message”
prepare to push to github
change wd out of project repo
create repo on github
copy repo URL from ‘code’ button
-git clone
-git push

Commands

git init

convert an existing, unversioned project to a Git repository or initialize a new empty repository.

git clone

copies an existing Git repository

git status

show status of all tracked and untracked files

git log

full status of git commands –stat includes a brief summary
–oneline (condense each commit to a single line)
–graph –decorate To include branches and the ref names:

git reflog

get complete history in repo

git add

starts tracking
git add . (adds all files)

git commit

commit changes
-m “message”
-a does the add as well as commit

git show

see changes that were made to tracked files
git show to see textual differences
git show HEAD reference to the currently checked-out commit/branch.
git show HEAD~2 shows the second commit message

git checkout –

discard changes in working directory
git checkout HEAD~2

git diff

To see what you’ve changed but not yet staged

git restore –staged

to unstage

git reset

permanetly reverts code to hash ID

Config

git config –global user.name “name”
git config –global user.email